# DataFrames range binning

I would like to perform efficient block-wise statistics on binned variables… and I can’t find an elegant solution.

given:

``````using DataFrames
df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))
``````

I would like to take the mean (or other function) and count of `v` where `x` and `y` fall within the bins :

``````x_edge = 0:0.01:1
y_edge = 0:0.01:1
``````

The only way I could figure out how to do this is using the `Histogram` function in `StatsBase` which seem really cumbersome

``````using StatsBase
df[!,:bin_ind] = CartesianIndex.(
StatsBase.binindex.(Ref(fit(Histogram, df.y, y_edge)), df.y),
StatsBase.binindex.(Ref(fit(Histogram, df.x, x_edge)), df.x))

foo = combine(groupby(df, :bin_ind), [:v => mean, :v => length])
``````

Is there a better way to do this? I’m thinking something like a `binindex` function specific to `DataFrames`. This seem like it would be a common operation.

In case the histogram edges are as convenient as in the post, something like this could work:

``````df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))
transform!(df, [:x, :y] => ByRow((x,y)->(floor(Int,x*100), floor(Int,y*100))) => :cell)
foo = combine(groupby(df, :cell), nrow, :v => mean)
# use of `nrow` as @bkamins suggested
``````

In any case, it seems you are dropping the `df.x` and `df.y` into the histogram twice in the post (once in `fit` and the other time with `binindex`) which is a waste.

Use CategoricalArrays.jl in general:

``````julia> using CategoricalArrays

julia> df.x_bin = cut(df.x, x_edge);

julia> df.y_bin = cut(df.y, y_edge);

julia> using Statistics

julia> combine(groupby(df, :x_bin), nrow, :v => mean)
100×3 DataFrame
Row │ x_bin         nrow   v_mean
│ Cat…          Int64  Float64
─────┼───────────────────────────────
1 │ [0.0, 0.01)    1041  0.50743
2 │ [0.01, 0.02)   1026  0.497069
3 │ [0.02, 0.03)    964  0.488265
4 │ [0.03, 0.04)    947  0.497805
⋮  │      ⋮          ⋮       ⋮
98 │ [0.97, 0.98)    961  0.503324
99 │ [0.98, 0.99)    996  0.507203
100 │ [0.99, 1.0)    1010  0.492615
93 rows omitted

julia> combine(groupby(df, :y_bin), nrow, :v => mean)
100×3 DataFrame
Row │ y_bin         nrow   v_mean
│ Cat…          Int64  Float64
─────┼───────────────────────────────
1 │ [0.0, 0.01)    1008  0.496496
2 │ [0.01, 0.02)   1024  0.491395
3 │ [0.02, 0.03)    981  0.502833
4 │ [0.03, 0.04)   1014  0.491912
⋮  │      ⋮          ⋮       ⋮
98 │ [0.97, 0.98)   1028  0.491234
99 │ [0.98, 0.99)   1054  0.496577
100 │ [0.99, 1.0)     943  0.503246
93 rows omitted

julia> combine(groupby(df, [:x_bin, :y_bin]), nrow, :v => mean)
10000×4 DataFrame
Row │ x_bin        y_bin         nrow   v_mean
│ Cat…         Cat…          Int64  Float64
───────┼────────────────────────────────────────────
1 │ [0.0, 0.01)  [0.0, 0.01)      10  0.509982
2 │ [0.0, 0.01)  [0.01, 0.02)     14  0.650093
3 │ [0.0, 0.01)  [0.02, 0.03)      8  0.58933
4 │ [0.0, 0.01)  [0.03, 0.04)     12  0.511574
⋮   │      ⋮            ⋮          ⋮       ⋮
9998 │ [0.99, 1.0)  [0.97, 0.98)     10  0.491388
9999 │ [0.99, 1.0)  [0.98, 0.99)     10  0.319291
10000 │ [0.99, 1.0)  [0.99, 1.0)      15  0.518321
9993 rows omitted
``````
1 Like

Perfect, thank you! In all my searching I hadn’t yet run across CategoricalArrays.jl