Dynamically split DataFrame into bins

I suppose this is a simple thing to do but I don’t quite know what to search for.
I’m currently wondering how I can dynamically split a DataFrame into equal size bins.

using DataFrames, StatsPlots
df = DataFrame(a = rand(100))
bins = [df.a[1:20], df.a[21:40], df.a[41:60], df.a[61:80], df.a[81:100]]
boxplot(1:length(bins), bins)

Now this approach does two things I dislike:

  1. It creates an array instead of grouping the df. I’d actually prefer to simply create a GroupedDataFrame via some groupby command (which is just a view into the df and doesn’t allocate anything) and then use the resulting subdf in each group for plotting.
  2. It is not dynamic, meaning that I don’t have a way to just say “split df into 5 equal groups based on row number”. I know similar things are possible with cut from CategoricalArrays.jl (e.g. in the above example I could just do cut(df.a, 5)) but I’m not looking for an ordered array grouped into a given number of bins as output but want to retain the previous order of the df.

Most likely what I’m asking for can be done very easily and it’s just that I can’t see the forest for the trees. :slight_smile:

This might be a bit hacky, but it gets the job done (I think):

using DataFrames

df = DataFrame(a = rand(100))

df.group = repeat(1:5, inner=20)

julia> groupby(df, :group)
GroupedDataFrame with 5 groups based on key: group
First Group (20 rows): group = 1
│ Row │ a        │ group │
│     │ Float64  │ Int64 │
├─────┼──────────┼───────┤
│ 1   │ 0.362288 │ 1     │
│ 2   │ 0.461975 │ 1     │
│ 3   │ 0.179701 │ 1     │
│ 4   │ 0.992212 │ 1     │
│ 5   │ 0.428374 │ 1     │
│ 6   │ 0.967794 │ 1     │
⋮
│ 14  │ 0.24395  │ 1     │
│ 15  │ 0.812229 │ 1     │
│ 16  │ 0.909791 │ 1     │
│ 17  │ 0.151844 │ 1     │
│ 18  │ 0.33243  │ 1     │
│ 19  │ 0.922681 │ 1     │
│ 20  │ 0.818121 │ 1     │
⋮
Last Group (20 rows): group = 5
│ Row │ a          │ group │
│     │ Float64    │ Int64 │
├─────┼────────────┼───────┤
│ 1   │ 0.432551   │ 5     │
│ 2   │ 0.639232   │ 5     │
│ 3   │ 0.00861674 │ 5     │
│ 4   │ 0.703537   │ 5     │
│ 5   │ 0.0616792  │ 5     │
│ 6   │ 0.615816   │ 5     │
⋮
│ 14  │ 0.352876   │ 5     │
│ 15  │ 0.920175   │ 5     │
│ 16  │ 0.388247   │ 5     │
│ 17  │ 0.738989   │ 5     │
│ 18  │ 0.473766   │ 5     │
│ 19  │ 0.731183   │ 5     │
│ 20  │ 0.613228   │ 5     │
4 Likes

Definitely just create a variable and group on that!

2 Likes

Thank you, this is a very nice and easy way to do it. I think it specifically makes sense to do it like this because it allows for column-wise operations which DataFrames is built on. Most likely I will apply such an approach as yours to tackle this. Thanks!

I wonder if @bkamins would be willing to chime in and provide some insights on whether or not this is the preferred way of doing it?

This is how I would do it.

An alternative using SplitApplyCombine.jl would be to materialize a vector of DataFrames:

julia> using SplitApplyCombine

julia> [df[i, :] for i in groupinds(repeat(1:5, inner=20))]

(you could alternatively create views)

The benefit of this approach is that you do not add grouping column to a data frame (but as said above - I would personally prefer to have this column added and group on it).

1 Like