I have a dataframe that I would like to group by a categorical variable (4 values, let’s say a, b, c and d) and then by a continuous variable (values between 0 and 40). I’d like the continuous variable to be in groups too. Zero to five, five to ten, ten to fifteen, etc. Ideally I’d end up with 32 different dataframes by the end of this process. I know how to us groupby for categorical variables, but not grouping by continuous variables.
Thank you.
I believe that you’re conflating 2 steps. You need to first bin your continuous values into your desired bins and then use groupby, which simply groups by distinct values, regardless of whether the variables are discrete or continuous.
You could write your own binning function or perhaps use the fit function from here:
https://juliastats.org/StatsBase.jl/v0.33/empirical/#Histograms-1
Somebody else may give a better answer. For this specific problem, you may try something similar
# load libraries
using DataFrames, Statistics, Distributions
# create a demo DataFrame
df = DataFrame(a = repeat('a':'d', outer = 100), b = rand(Uniform(0,40), 400), c = randn(400))
# generate a group variable based on the continuous variable, subject to specific problem
df[:, :group] = map(x -> div(x, 5.0), df.b)
# create a groupby DataFrame based on variables a and group
gdf = groupby(df, [:a, :group])
# get some summary Statistics by group
combine(gdf, :c=>mean)
Also for binning:
using CategoricalArrays
?cut
using DataFrames
also works (as cut is reexported from CategoricalArrays)
Yes, cut is nice:
julia> by(transform(DataFrame(x=rand([:a,:b], 10^6), y=rand(10^6)), :y => (y->cut(y,3)) => :y2), [:x,:y2], :y => mean)
6×3 DataFrame
│ Row │ x │ y2 │ y_mean │
│ │ Symbol │ CategoricalValue… │ Float64 │
├────┼───────┼────────────────────────┼─────────┤
│ 1 │ b │ Q3: [0.666464, 0.999998] │ 0.833538 │
│ 2 │ b │ Q2: [0.333284, 0.666464) │ 0.499986 │
│ 3 │ a │ Q2: [0.333284, 0.666464) │ 0.499864 │
│ 4 │ a │ Q3: [0.666464, 0.999998] │ 0.833266 │
│ 5 │ a │ Q1: [1.59801e-6, 0.333284) │ 0.166748 │
│ 6 │ b │ Q1: [1.59801e-6, 0.333284) │ 0.167013 │
Using cut
is a good idea. But your method will not guarantee that it cuts at 5, 10, 15, etc.
A slightly modified version of creating a group variable using cut
is
df[:,:g] = cut(df.b, 0:5:40)
if all values in column b
falls into [0,40].
Thank you for all of the replies, this is very useful information.
Using cut seems the most straightforward, so this is what I’ve gone with. However, after using cut I am not able to then groupby on the new column. The new column has arrays, like [5,10) and [10,15). So it throws an error and won’t group. Any tips on how to deal with this part?
Thanks
Can you show the full error you get?
My code:
df[:,:g] = cut(df.Years, 0:5:40)
gdf = groupby(df, :g)
Error:
ERROR: LoadError: BoundsError: attempt to access 9-element Array{Int64,1} at index [2301742737]
An aside, I don’t know how to put code in the grey box thing, sorry.
There were missing values, I’ve figured it out now. Thank you for the help.
Hmm, that shouldn’t happen anyway. Would you be able to post code to reproduce the problem on my machine? What version of DataFrames are you using? Does the following code give an error on your machine (it doesn’t on mine)?
df = DataFrame(Years=replace(rand(0:40, 10_000), 40=>missing))
df[:,:g] = cut(df.Years, 0:5:40)
gdf = groupby(df, :g)
I have DataFrames v0.18.4.
Yes that code does give an error on my machine.
I’ve removed/replaced the missing values and everything works now. Why would it work on yours and not mine?
OK, I think I fixed this in CategoricalArrays 0.8. You should upgrade to DataFrames 0.21 when you can.