Using the groupby function

Chris_Anderson · June 5, 2020, 11:38pm

I have a dataframe that I would like to group by a categorical variable (4 values, let’s say a, b, c and d) and then by a continuous variable (values between 0 and 40). I’d like the continuous variable to be in groups too. Zero to five, five to ten, ten to fifteen, etc. Ideally I’d end up with 32 different dataframes by the end of this process. I know how to us groupby for categorical variables, but not grouping by continuous variables.
Thank you.

robsmith11 · June 5, 2020, 11:54pm

I believe that you’re conflating 2 steps. You need to first bin your continuous values into your desired bins and then use groupby, which simply groups by distinct values, regardless of whether the variables are discrete or continuous.

You could write your own binning function or perhaps use the fit function from here:
https://juliastats.org/StatsBase.jl/v0.33/empirical/#Histograms-1

jack_rabbit · June 6, 2020, 12:12am

Somebody else may give a better answer. For this specific problem, you may try something similar

# load libraries
using DataFrames, Statistics, Distributions

# create a demo DataFrame
df = DataFrame(a = repeat('a':'d', outer = 100), b = rand(Uniform(0,40), 400),  c = randn(400))
# generate a group variable based on the continuous variable, subject to specific problem
df[:, :group] = map(x -> div(x, 5.0), df.b)
# create a groupby DataFrame based on variables a and group
gdf = groupby(df, [:a, :group])
# get some summary Statistics by group
combine(gdf, :c=>mean)

Mattriks · June 6, 2020, 12:36am

Also for binning:

using CategoricalArrays
?cut

using DataFrames also works (as cut is reexported from CategoricalArrays)

robsmith11 · June 6, 2020, 7:07am

Yes, cut is nice:

julia> by(transform(DataFrame(x=rand([:a,:b], 10^6), y=rand(10^6)), :y => (y->cut(y,3)) => :y2), [:x,:y2], :y => mean)
6×3 DataFrame
│ Row │ x      │ y2                         │ y_mean   │
│     │ Symbol │ CategoricalValue…          │ Float64  │
├────┼───────┼────────────────────────┼─────────┤
│ 1   │ b      │ Q3: [0.666464, 0.999998]   │ 0.833538 │
│ 2   │ b      │ Q2: [0.333284, 0.666464)   │ 0.499986 │
│ 3   │ a      │ Q2: [0.333284, 0.666464)   │ 0.499864 │
│ 4   │ a      │ Q3: [0.666464, 0.999998]   │ 0.833266 │
│ 5   │ a      │ Q1: [1.59801e-6, 0.333284) │ 0.166748 │
│ 6   │ b      │ Q1: [1.59801e-6, 0.333284) │ 0.167013 │

jack_rabbit · June 6, 2020, 1:01pm

Using cut is a good idea. But your method will not guarantee that it cuts at 5, 10, 15, etc.
A slightly modified version of creating a group variable using cut is

df[:,:g] = cut(df.b, 0:5:40)

if all values in column b falls into [0,40].

Chris_Anderson · June 6, 2020, 2:51pm

Thank you for all of the replies, this is very useful information.

Using cut seems the most straightforward, so this is what I’ve gone with. However, after using cut I am not able to then groupby on the new column. The new column has arrays, like [5,10) and [10,15). So it throws an error and won’t group. Any tips on how to deal with this part?

Thanks

nalimilan · June 6, 2020, 3:02pm

Can you show the full error you get?

Chris_Anderson · June 6, 2020, 3:15pm

My code:
df[:,:g] = cut(df.Years, 0:5:40)
gdf = groupby(df, :g)

Error:
ERROR: LoadError: BoundsError: attempt to access 9-element Array{Int64,1} at index [2301742737]

An aside, I don’t know how to put code in the grey box thing, sorry.

Chris_Anderson · June 6, 2020, 3:28pm

There were missing values, I’ve figured it out now. Thank you for the help.

nalimilan · June 6, 2020, 3:40pm

Hmm, that shouldn’t happen anyway. Would you be able to post code to reproduce the problem on my machine? What version of DataFrames are you using? Does the following code give an error on your machine (it doesn’t on mine)?

df = DataFrame(Years=replace(rand(0:40, 10_000), 40=>missing))
df[:,:g] = cut(df.Years, 0:5:40)
gdf = groupby(df, :g)

Chris_Anderson · June 6, 2020, 3:52pm

I have DataFrames v0.18.4.
Yes that code does give an error on my machine.
I’ve removed/replaced the missing values and everything works now. Why would it work on yours and not mine?

nalimilan · June 6, 2020, 4:12pm

OK, I think I fixed this in CategoricalArrays 0.8. You should upgrade to DataFrames 0.21 when you can.

Topic		Replies	Views
Best way to bin data from dataframe? New to Julia	3	932	August 6, 2019
Equivalent to Pandas "cut" in Julia DataFrames? New to Julia question , dataframes	4	862	August 21, 2023
DataFrames range binning Data	4	1020	August 8, 2024
Group DataFrames by a function of a column Data package	4	1206	December 11, 2019
Discretize/Binning of Continuous Variable in Dataframe New to Julia	2	718	August 9, 2021

Using the groupby function

Related topics