Group DataFrames by a function of a column

I’m starting to use DataFrames more frequently and find myself often having to create a temporary table and column just to get a transformed column by which to group.

For example:

julia> d=DataFrame(x=rand(10^6));

julia> d[!,:g] = cut(d.x, 4);

julia> by(d, :g, :x => mean)
4×2 DataFrame
│ Row │ g                     │ x_mean   │
│     │ Categorical…          │ Float64  │
├─────┼───────────────────────┼──────────┤
│ 1   │ [2.55266e-7, 0.24991) │ 0.124957 │
│ 2   │ [0.24991, 0.500171)   │ 0.375074 │
│ 3   │ [0.500171, 0.750146)  │ 0.62502  │
│ 4   │ [0.750146, 0.999997]  │ 0.875144 │

I often need to create binned statistics like this at the end of a chain of table transformations, so I’d prefer to be able to do everything in one line without the temp tables. Is there any way to acheive something like the following syntax? Is it reasonable to to make a feature request to support it?

julia> by(d, cut(:x, 4), :x => mean)

This syntax would be similar to what is possible in kdb+/q:

select avg x by 4 xrank x from d
2 Likes

this isn’t possible at the moment in DataFrames. You can file an issue that would make this easier, but note that your function always has to be evaluated, so unless cut only returns an iterator and not a vector, I would just write a small wrapper function to do this for you.

2 Likes

this syntax would only be possible in a macro. Perhaps raise it with DataFramesMeta.jl.

2 Likes

Another approach might be to just do this using a vector-based interface e.g. FastGroupBy.jl’s; and the @df macro. E.g.

using DataFrames

d = DataFrame(x = rand(1000))

using FastGroupBy, StatsPlots, CategoricalArrays

res = @df d fastby(mean, cut(:x, 4), :x) |> DataFrame

Just realised I left the “woohoo” in my code :frowning: need to fix ASAP.

It’s more than reasonable to request this as a feature.

Hmmm… I was doing this maybe 6-7 months ago through some sort of weird work around not using cut… Maybe I made my own iterator? I remember it not being too hard, but groking the latest code looks like things are different from what I remember… Maybe take a crack at making one? :smiley:

https://github.com/JuliaData/DataFrames.jl/blob/025824f80e720693b7e21ac49d8ae64c8830ce98/src/groupeddataframe/grouping.jl#L165