I would like to perform efficient block-wise statistics on binned variables… and I can’t find an elegant solution.
given:
using DataFrames
df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))
I would like to take the mean (or other function) and count of v
where x
and y
fall within the bins :
x_edge = 0:0.01:1
y_edge = 0:0.01:1
The only way I could figure out how to do this is using the Histogram
function in StatsBase
which seem really cumbersome
using StatsBase
df[!,:bin_ind] = CartesianIndex.(
StatsBase.binindex.(Ref(fit(Histogram, df.y, y_edge)), df.y),
StatsBase.binindex.(Ref(fit(Histogram, df.x, x_edge)), df.x))
foo = combine(groupby(df, :bin_ind), [:v => mean, :v => length])
Is there a better way to do this? I’m thinking something like a binindex
function specific to DataFrames
. This seem like it would be a common operation.
1 Like
Dan
January 19, 2023, 3:06am
2
In case the histogram edges are as convenient as in the post, something like this could work:
df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))
transform!(df, [:x, :y] => ByRow((x,y)->(floor(Int,x*100), floor(Int,y*100))) => :cell)
foo = combine(groupby(df, :cell), nrow, :v => mean)
# use of `nrow` as @bkamins suggested
In any case, it seems you are dropping the df.x
and df.y
into the histogram twice in the post (once in fit
and the other time with binindex
) which is a waste.
Use CategoricalArrays.jl in general:
julia> using CategoricalArrays
julia> df.x_bin = cut(df.x, x_edge);
julia> df.y_bin = cut(df.y, y_edge);
julia> using Statistics
julia> combine(groupby(df, :x_bin), nrow, :v => mean)
100×3 DataFrame
Row │ x_bin nrow v_mean
│ Cat… Int64 Float64
─────┼───────────────────────────────
1 │ [0.0, 0.01) 1041 0.50743
2 │ [0.01, 0.02) 1026 0.497069
3 │ [0.02, 0.03) 964 0.488265
4 │ [0.03, 0.04) 947 0.497805
⋮ │ ⋮ ⋮ ⋮
98 │ [0.97, 0.98) 961 0.503324
99 │ [0.98, 0.99) 996 0.507203
100 │ [0.99, 1.0) 1010 0.492615
93 rows omitted
julia> combine(groupby(df, :y_bin), nrow, :v => mean)
100×3 DataFrame
Row │ y_bin nrow v_mean
│ Cat… Int64 Float64
─────┼───────────────────────────────
1 │ [0.0, 0.01) 1008 0.496496
2 │ [0.01, 0.02) 1024 0.491395
3 │ [0.02, 0.03) 981 0.502833
4 │ [0.03, 0.04) 1014 0.491912
⋮ │ ⋮ ⋮ ⋮
98 │ [0.97, 0.98) 1028 0.491234
99 │ [0.98, 0.99) 1054 0.496577
100 │ [0.99, 1.0) 943 0.503246
93 rows omitted
julia> combine(groupby(df, [:x_bin, :y_bin]), nrow, :v => mean)
10000×4 DataFrame
Row │ x_bin y_bin nrow v_mean
│ Cat… Cat… Int64 Float64
───────┼────────────────────────────────────────────
1 │ [0.0, 0.01) [0.0, 0.01) 10 0.509982
2 │ [0.0, 0.01) [0.01, 0.02) 14 0.650093
3 │ [0.0, 0.01) [0.02, 0.03) 8 0.58933
4 │ [0.0, 0.01) [0.03, 0.04) 12 0.511574
⋮ │ ⋮ ⋮ ⋮ ⋮
9998 │ [0.99, 1.0) [0.97, 0.98) 10 0.491388
9999 │ [0.99, 1.0) [0.98, 0.99) 10 0.319291
10000 │ [0.99, 1.0) [0.99, 1.0) 15 0.518321
9993 rows omitted
5 Likes
Perfect, thank you! In all my searching I hadn’t yet run across CategoricalArrays.jl
In case people want to further process the categorial data, e.g. for plotting, this can be done, e.g. via
fmt(from, to, i; leftclosed, rightclosed) = (from + to)/2
df.x_bin = cut(df.x, x_edge, labels=fmt) # labels are midpoints of intervals now
unwrap.(df.x_bin)