DataFrames range binning

alex-s-gardner · January 19, 2023, 2:42am

I would like to perform efficient block-wise statistics on binned variables… and I can’t find an elegant solution.

given:

using DataFrames
df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))

I would like to take the mean (or other function) and count of v where x and y fall within the bins :

x_edge = 0:0.01:1
y_edge = 0:0.01:1

The only way I could figure out how to do this is using the Histogram function in StatsBase which seem really cumbersome

using StatsBase
df[!,:bin_ind] = CartesianIndex.(
        StatsBase.binindex.(Ref(fit(Histogram, df.y, y_edge)), df.y), 
        StatsBase.binindex.(Ref(fit(Histogram, df.x, x_edge)), df.x))

 foo = combine(groupby(df, :bin_ind), [:v => mean, :v => length])

Is there a better way to do this? I’m thinking something like a binindex function specific to DataFrames. This seem like it would be a common operation.

Dan · January 19, 2023, 3:06am

In case the histogram edges are as convenient as in the post, something like this could work:

df = DataFrame(x = rand(100000), y = rand(100000), v = rand(100000))
transform!(df, [:x, :y] => ByRow((x,y)->(floor(Int,x*100), floor(Int,y*100))) => :cell)
foo = combine(groupby(df, :cell), nrow, :v => mean)
# use of `nrow` as @bkamins suggested

In any case, it seems you are dropping the df.x and df.y into the histogram twice in the post (once in fit and the other time with binindex) which is a waste.

bkamins · January 19, 2023, 3:07am

Use CategoricalArrays.jl in general:

julia> using CategoricalArrays

julia> df.x_bin = cut(df.x, x_edge);

julia> df.y_bin = cut(df.y, y_edge);

julia> using Statistics

julia> combine(groupby(df, :x_bin), nrow, :v => mean)
100×3 DataFrame
 Row │ x_bin         nrow   v_mean
     │ Cat…          Int64  Float64
─────┼───────────────────────────────
   1 │ [0.0, 0.01)    1041  0.50743
   2 │ [0.01, 0.02)   1026  0.497069
   3 │ [0.02, 0.03)    964  0.488265
   4 │ [0.03, 0.04)    947  0.497805
  ⋮  │      ⋮          ⋮       ⋮
  98 │ [0.97, 0.98)    961  0.503324
  99 │ [0.98, 0.99)    996  0.507203
 100 │ [0.99, 1.0)    1010  0.492615
                      93 rows omitted

julia> combine(groupby(df, :y_bin), nrow, :v => mean)
100×3 DataFrame
 Row │ y_bin         nrow   v_mean
     │ Cat…          Int64  Float64
─────┼───────────────────────────────
   1 │ [0.0, 0.01)    1008  0.496496
   2 │ [0.01, 0.02)   1024  0.491395
   3 │ [0.02, 0.03)    981  0.502833
   4 │ [0.03, 0.04)   1014  0.491912
  ⋮  │      ⋮          ⋮       ⋮
  98 │ [0.97, 0.98)   1028  0.491234
  99 │ [0.98, 0.99)   1054  0.496577
 100 │ [0.99, 1.0)     943  0.503246
                      93 rows omitted

julia> combine(groupby(df, [:x_bin, :y_bin]), nrow, :v => mean)
10000×4 DataFrame
   Row │ x_bin        y_bin         nrow   v_mean
       │ Cat…         Cat…          Int64  Float64
───────┼────────────────────────────────────────────
     1 │ [0.0, 0.01)  [0.0, 0.01)      10  0.509982
     2 │ [0.0, 0.01)  [0.01, 0.02)     14  0.650093
     3 │ [0.0, 0.01)  [0.02, 0.03)      8  0.58933
     4 │ [0.0, 0.01)  [0.03, 0.04)     12  0.511574
   ⋮   │      ⋮            ⋮          ⋮       ⋮
  9998 │ [0.99, 1.0)  [0.97, 0.98)     10  0.491388
  9999 │ [0.99, 1.0)  [0.98, 0.99)     10  0.319291
 10000 │ [0.99, 1.0)  [0.99, 1.0)      15  0.518321
                                   9993 rows omitted

alex-s-gardner · January 19, 2023, 3:28am

Perfect, thank you! In all my searching I hadn’t yet run across CategoricalArrays.jl

jamblejoe · August 8, 2024, 4:07pm

In case people want to further process the categorial data, e.g. for plotting, this can be done, e.g. via

fmt(from, to, i; leftclosed, rightclosed) = (from + to)/2
df.x_bin = cut(df.x, x_edge, labels=fmt) # labels are midpoints of intervals now
unwrap.(df.x_bin)