Group DataFrames by a function of a column

robsmith11 · December 11, 2019, 9:45pm

I’m starting to use DataFrames more frequently and find myself often having to create a temporary table and column just to get a transformed column by which to group.

For example:

julia> d=DataFrame(x=rand(10^6));

julia> d[!,:g] = cut(d.x, 4);

julia> by(d, :g, :x => mean)
4×2 DataFrame
│ Row │ g                     │ x_mean   │
│     │ Categorical…          │ Float64  │
├─────┼───────────────────────┼──────────┤
│ 1   │ [2.55266e-7, 0.24991) │ 0.124957 │
│ 2   │ [0.24991, 0.500171)   │ 0.375074 │
│ 3   │ [0.500171, 0.750146)  │ 0.62502  │
│ 4   │ [0.750146, 0.999997]  │ 0.875144 │

I often need to create binned statistics like this at the end of a chain of table transformations, so I’d prefer to be able to do everything in one line without the temp tables. Is there any way to acheive something like the following syntax? Is it reasonable to to make a feature request to support it?

julia> by(d, cut(:x, 4), :x => mean)

This syntax would be similar to what is possible in kdb+/q:

select avg x by 4 xrank x from d

pdeffebach · December 11, 2019, 10:07pm

this isn’t possible at the moment in DataFrames. You can file an issue that would make this easier, but note that your function always has to be evaluated, so unless cut only returns an iterator and not a vector, I would just write a small wrapper function to do this for you.

xiaodai · December 11, 2019, 10:42pm

this syntax would only be possible in a macro. Perhaps raise it with DataFramesMeta.jl.

xiaodai · December 11, 2019, 11:08pm

Another approach might be to just do this using a vector-based interface e.g. FastGroupBy.jl’s; and the @df macro. E.g.

using DataFrames

d = DataFrame(x = rand(1000))

using FastGroupBy, StatsPlots, CategoricalArrays

res = @df d fastby(mean, cut(:x, 4), :x) |> DataFrame

Just realised I left the “woohoo” in my code need to fix ASAP.

anon92994695 · December 11, 2019, 11:29pm

It’s more than reasonable to request this as a feature.

Hmmm… I was doing this maybe 6-7 months ago through some sort of weird work around not using cut… Maybe I made my own iterator? I remember it not being too hard, but groking the latest code looks like things are different from what I remember… Maybe take a crack at making one?

https://github.com/JuliaData/DataFrames.jl/blob/025824f80e720693b7e21ac49d8ae64c8830ce98/src/groupeddataframe/grouping.jl#L165

Topic		Replies	Views
Groupby on an expression or a vector? New to Julia	21	562	June 11, 2024
Best way to bin data from dataframe? New to Julia	3	932	August 6, 2019
Create grouped dataframe by properties of a given column? New to Julia dataframes , grouped-data	9	392	April 26, 2024
DataFrame Groupby New to Julia dataframes	2	2148	April 26, 2018
Grouping a DataFrame by something other than an existing column Data data	2	857	August 6, 2017

Group DataFrames by a function of a column

Related topics