Dynamically split DataFrame into bins

fbanning · October 9, 2020, 4:01pm

I suppose this is a simple thing to do but I don’t quite know what to search for.
I’m currently wondering how I can dynamically split a DataFrame into equal size bins.

using DataFrames, StatsPlots
df = DataFrame(a = rand(100))
bins = [df.a[1:20], df.a[21:40], df.a[41:60], df.a[61:80], df.a[81:100]]
boxplot(1:length(bins), bins)

Now this approach does two things I dislike:

It creates an array instead of grouping the df. I’d actually prefer to simply create a GroupedDataFrame via some groupby command (which is just a view into the df and doesn’t allocate anything) and then use the resulting subdf in each group for plotting.
It is not dynamic, meaning that I don’t have a way to just say “split df into 5 equal groups based on row number”. I know similar things are possible with cut from CategoricalArrays.jl (e.g. in the above example I could just do cut(df.a, 5)) but I’m not looking for an ordered array grouped into a given number of bins as output but want to retain the previous order of the df.

Most likely what I’m asking for can be done very easily and it’s just that I can’t see the forest for the trees.

mthelm85 · October 9, 2020, 4:18pm

This might be a bit hacky, but it gets the job done (I think):

using DataFrames

df = DataFrame(a = rand(100))

df.group = repeat(1:5, inner=20)

julia> groupby(df, :group)
GroupedDataFrame with 5 groups based on key: group
First Group (20 rows): group = 1
│ Row │ a        │ group │
│     │ Float64  │ Int64 │
├─────┼──────────┼───────┤
│ 1   │ 0.362288 │ 1     │
│ 2   │ 0.461975 │ 1     │
│ 3   │ 0.179701 │ 1     │
│ 4   │ 0.992212 │ 1     │
│ 5   │ 0.428374 │ 1     │
│ 6   │ 0.967794 │ 1     │
⋮
│ 14  │ 0.24395  │ 1     │
│ 15  │ 0.812229 │ 1     │
│ 16  │ 0.909791 │ 1     │
│ 17  │ 0.151844 │ 1     │
│ 18  │ 0.33243  │ 1     │
│ 19  │ 0.922681 │ 1     │
│ 20  │ 0.818121 │ 1     │
⋮
Last Group (20 rows): group = 5
│ Row │ a          │ group │
│     │ Float64    │ Int64 │
├─────┼────────────┼───────┤
│ 1   │ 0.432551   │ 5     │
│ 2   │ 0.639232   │ 5     │
│ 3   │ 0.00861674 │ 5     │
│ 4   │ 0.703537   │ 5     │
│ 5   │ 0.0616792  │ 5     │
│ 6   │ 0.615816   │ 5     │
⋮
│ 14  │ 0.352876   │ 5     │
│ 15  │ 0.920175   │ 5     │
│ 16  │ 0.388247   │ 5     │
│ 17  │ 0.738989   │ 5     │
│ 18  │ 0.473766   │ 5     │
│ 19  │ 0.731183   │ 5     │
│ 20  │ 0.613228   │ 5     │

pdeffebach · October 9, 2020, 4:23pm

Definitely just create a variable and group on that!

fbanning · October 12, 2020, 2:06pm

Thank you, this is a very nice and easy way to do it. I think it specifically makes sense to do it like this because it allows for column-wise operations which DataFrames is built on. Most likely I will apply such an approach as yours to tackle this. Thanks!

I wonder if @bkamins would be willing to chime in and provide some insights on whether or not this is the preferred way of doing it?

bkamins · October 12, 2020, 3:27pm

This is how I would do it.

An alternative using SplitApplyCombine.jl would be to materialize a vector of DataFrames:

julia> using SplitApplyCombine

julia> [df[i, :] for i in groupinds(repeat(1:5, inner=20))]

(you could alternatively create views)

The benefit of this approach is that you do not add grouping column to a data frame (but as said above - I would personally prefer to have this column added and group on it).

Topic		Replies	Views
Best way to bin data from dataframe? New to Julia	3	932	August 6, 2019
Split dataframe row into multiple rows Data dataframes	8	1780	May 1, 2022
Equivalent to Pandas "cut" in Julia DataFrames? New to Julia question , dataframes	4	852	August 21, 2023
Using the groupby function Data	12	2633	June 6, 2020
Group DataFrames by a function of a column Data package	4	1204	December 11, 2019

Dynamically split DataFrame into bins

Related topics