I would like to perform efficient block-wise statistics on binned variables… and I can’t find an elegant solution.
given:
using DataFrames
df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))
I would like to take the mean (or other function) and count of v where x and y fall within the bins :
x_edge = 0:0.01:1
y_edge = 0:0.01:1
The only way I could figure out how to do this is using the Histogram function in StatsBase which seem really cumbersome
using StatsBase
df[!,:bin_ind] = CartesianIndex.(
StatsBase.binindex.(Ref(fit(Histogram, df.y, y_edge)), df.y),
StatsBase.binindex.(Ref(fit(Histogram, df.x, x_edge)), df.x))
foo = combine(groupby(df, :bin_ind), [:v => mean, :v => length])
Is there a better way to do this? I’m thinking something like a binindex function specific to DataFrames. This seem like it would be a common operation.
1 Like
Dan
January 19, 2023, 3:06am
2
In case the histogram edges are as convenient as in the post, something like this could work:
df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))
transform!(df, [:x, :y] => ByRow((x,y)->(floor(Int,x*100), floor(Int,y*100))) => :cell)
foo = combine(groupby(df, :cell), nrow, :v => mean)
# use of `nrow` as @bkamins suggested
In any case, it seems you are dropping the df.x and df.y into the histogram twice in the post (once in fit and the other time with binindex) which is a waste.
Use CategoricalArrays.jl in general:
julia> using CategoricalArrays
julia> df.x_bin = cut(df.x, x_edge);
julia> df.y_bin = cut(df.y, y_edge);
julia> using Statistics
julia> combine(groupby(df, :x_bin), nrow, :v => mean)
100×3 DataFrame
Row │ x_bin nrow v_mean
│ Cat… Int64 Float64
─────┼───────────────────────────────
1 │ [0.0, 0.01) 1041 0.50743
2 │ [0.01, 0.02) 1026 0.497069
3 │ [0.02, 0.03) 964 0.488265
4 │ [0.03, 0.04) 947 0.497805
⋮ │ ⋮ ⋮ ⋮
98 │ [0.97, 0.98) 961 0.503324
99 │ [0.98, 0.99) 996 0.507203
100 │ [0.99, 1.0) 1010 0.492615
93 rows omitted
julia> combine(groupby(df, :y_bin), nrow, :v => mean)
100×3 DataFrame
Row │ y_bin nrow v_mean
│ Cat… Int64 Float64
─────┼───────────────────────────────
1 │ [0.0, 0.01) 1008 0.496496
2 │ [0.01, 0.02) 1024 0.491395
3 │ [0.02, 0.03) 981 0.502833
4 │ [0.03, 0.04) 1014 0.491912
⋮ │ ⋮ ⋮ ⋮
98 │ [0.97, 0.98) 1028 0.491234
99 │ [0.98, 0.99) 1054 0.496577
100 │ [0.99, 1.0) 943 0.503246
93 rows omitted
julia> combine(groupby(df, [:x_bin, :y_bin]), nrow, :v => mean)
10000×4 DataFrame
Row │ x_bin y_bin nrow v_mean
│ Cat… Cat… Int64 Float64
───────┼────────────────────────────────────────────
1 │ [0.0, 0.01) [0.0, 0.01) 10 0.509982
2 │ [0.0, 0.01) [0.01, 0.02) 14 0.650093
3 │ [0.0, 0.01) [0.02, 0.03) 8 0.58933
4 │ [0.0, 0.01) [0.03, 0.04) 12 0.511574
⋮ │ ⋮ ⋮ ⋮ ⋮
9998 │ [0.99, 1.0) [0.97, 0.98) 10 0.491388
9999 │ [0.99, 1.0) [0.98, 0.99) 10 0.319291
10000 │ [0.99, 1.0) [0.99, 1.0) 15 0.518321
9993 rows omitted
5 Likes
Perfect, thank you! In all my searching I hadn’t yet run across CategoricalArrays.jl
In case people want to further process the categorial data, e.g. for plotting, this can be done, e.g. via
fmt(from, to, i; leftclosed, rightclosed) = (from + to)/2
df.x_bin = cut(df.x, x_edge, labels=fmt) # labels are midpoints of intervals now
unwrap.(df.x_bin)