Groupby generic table-shaped things

I know of Query.jl but frankly I hate both its syntaxes [1, 2]. (And also weirdly doesn’t depend on Tables.jl?)

In the Tables.jl universe, I know of TableOperations.jl and TableTransforms.jl but weirdly neither of them supports groupby? And why are there two of them?

DataFrames.jl has the best syntax by far in my opinion. It’s as simple as I can imagine a groupby being, for example

gdf = groupby(df, :key)
fdf = combine(gdf, :colname => sum => :colname_sum)

but it’s not generic, it’s specific to DataFrames.jl

So how do I manipulate table-shaped things in a generic manner? My requirements are light: is just needs to be generic and support a groupby.

[1] Standalone Query Commands · Query.jl
[2] LINQ Style Query Commands · Query.jl

2 Likes

Thank you for mentioning, but one immediate comment is that despite its name, it does not have an apply or a combine methods… I could do this myself “by hand”, but then I wouldn’t be asking for a package :slight_smile: As an example,

y = SplitApplyCombine.group(x, :key)

returns a Dictionary where the keys are the values of the keys key, and the values of vectors of elements of x. How do I process this downstream? Sure, I guess the apply would be equivalent to

apply(func, y) = map(f, values(y))

But compare how much cleaner my example from DataFrame.jl is.

I guess in my mind I’m imagining something which has exactly the DataFrame.jl syntax, but can act on generic Tables.jl :slight_smile:

Just a few days ago, I gave a JuliaCon talk on lightweight tables and generic operations on them – including group :slight_smile:
https://www.youtube.com/live/G_DLBmO1EGM?si=1KmauidWe1gwad27&t=15131 (with timecode)

From your question, not sure what exactly do you miss in the group syntax? It nicely follows Base Julia approach of functions like filter or map, and quite clean in practice. DataManipulation.jl package has a more performant and generic group implementation that SplitApplyCombine, but the basic interface is effectively the same:
map(gr -> ... operation ..., group(x -> ... key ..., data)).
As common with data manipulation in Julia, such a pipeline becomes even more readable with DataPipes syntax – e.g., to count elements per group:
@p data group(_.key) map(length).

Note: this works with basically any Julia collection, which most in-memory tables are. Unfortunately, DataFrames don’t follow the collection interface.

Thank you for that. I will listen to your talk right now, thank you for sharing.

Could you clarify, what is the difference between DataManipulation.jl and SplitApplyCombine.jl? Skimming they look very similar, even down to the use of Dictionaries.jl, no? - EDIT already mentioned in the talk thank you: SplitApplyCombine.jl is no longer developed and you wanted to implement some things differently.

Regarding syntax - I don’t know if it’s just me, but I can never ever write those queries without checking the docs first. Lets look at your example:

@p data group(_.key) map(length)

I try to avoid macros with unusual syntaxes. I can guess that _.key is something like x -> getproperty(x, :key). Ok. But what if I wanted x -> (getproperty(x, :key1), getproperty(x, :key2)), how would that fit in the @p macro? I’m guessing there’s another macro-specific syntax for it…

I wonder if you would show us how my example would be written within DataManipulation.jl. Here it is again:

gdf = groupby(df, :key)
fdf = combine(gdf, :colname => sum => :colname_sum)

For concreteness lets suppose we start with

xx = [
    (; a=true, b=1.),
    (; a=true, b=2.),
    (; a=false, b=3.),
    (; a=false, b=4.),
]

The group would be

gx = group(x->getproperty(x, :a), xx)

Then how would I do the sum? I tried

sx = map(sum, values(gx))

This doesn’t work, because it tries to apply the sum to the elements of the elements of gx which are NamedTuples, when in reality I want to sum over the b “column”.

And at the end you need at least to be able to rename b into b_sum and skimming I don’t see how to do renames in DataManipulation.jl

EDIT2: I just heard your talk, thank you again. What a shame that we can’t hear the discussion at the end, because that question/comment was also something I thought about.

“Unusual” is in the eye beholder :slight_smile: The @p macro exists just to make data manipulation pipelines more readable (including Base functions), everything can be written without it as well.

It’s even easier than you guessed! _.key is exactly the same as x -> x.key, so (_.key1, _.key2) is the same as x -> (x.key1, x.key2).

For columnar operations, it’s typically most convenient to use StructArrays. Let’s start with just grouping:

julia> tbl = StructArray([
           (; a=true, b=1.),
           (; a=true, b=2.),
           (; a=false, b=3.),
           (; a=false, b=4.),
       ])
4-element StructArray(::Vector{Bool}, ::Vector{Float64}) with eltype @NamedTuple{a::Bool, b::Float64}:
 (a = 1, b = 1.0)
 (a = 1, b = 2.0)
 (a = 0, b = 3.0)
 (a = 0, b = 4.0)

julia> grs = group(x -> x.a, tbl)
2-element Dictionaries.ArrayDictionary{Bool, StructVector{@NamedTuple{a::Bool, b::Float64}, @NamedTuple{a::SubArray{Bool, 1, Vector{Bool}, Tuple{UnitRange{Int64}}, true}, b::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}}, Int64}, Dictionaries.ArrayIndices{Bool, Vector{Bool}}, Vector{StructVector{@NamedTuple{a::Bool, b::Float64}, @NamedTuple{a::SubArray{Bool, 1, Vector{Bool}, Tuple{UnitRange{Int64}}, true}, b::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}}, Int64}}}:
  true │ @NamedTuple{a::Bool, b::Float64}[(a = 1, b = 1.0), (a = 1, b = 2.0)]
 false │ @NamedTuple{a::Bool, b::Float64}[(a = 0, b = 3.0), (a = 0, b = 4.0)]

Now, we have a collection of groups, and want some per-group operation. Within Julia Base semantics, it’s just map.
Note: we only needed DataManipulation.jl for group and won’t use it further in this example.

Given that each group is a StructArray, we can easily apply columnwise operations:

julia> map(gr -> sum(gr.b), grs)
2-element Dictionaries.ArrayDictionary{Bool, Float64, Dictionaries.ArrayIndices{Bool, Vector{Bool}}, Vector{Float64}}:
  true │ 3.0
 false │ 7.0

If you want to have a table back, it tends to be more convenient to work with arrays of groups instead of dictionaries:

julia> grs = group_vg(x -> (;x.a), tbl)  # "vg" for Vector of Groups
2-element StructArray(::Vector{@NamedTuple{a::Bool}}, ::Vector{StructVector{@NamedTuple{a::Bool, b::Float64}, @NamedTuple{a::SubArray{Bool, 1, Vector{Bool}, Tuple{UnitRange{Int64}}, true}, b::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}}, Int64}}) with eltype FlexiGroups.GroupArray{@NamedTuple{a::Bool}, @NamedTuple{a::Bool, b::Float64}, 1, StructVector{@NamedTuple{a::Bool, b::Float64}, @NamedTuple{a::SubArray{Bool, 1, Vector{Bool}, Tuple{UnitRange{Int64}}, true}, b::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}}, Int64}}:
 [(a = 1, b = 1.0), (a = 1, b = 2.0)]
 [(a = 0, b = 3.0), (a = 0, b = 4.0)]

julia> map(gr -> (;key(gr)..., b_sum=sum(gr.b)), grs)
2-element StructArray(::Vector{Bool}, ::Vector{Float64}) with eltype @NamedTuple{a::Bool, b_sum::Float64}:
 (a = 1, b_sum = 3.0)
 (a = 0, b_sum = 7.0)

As you see, “renaming” is done with just base Julia map. No extra functions needed!