Pre-computing Groupings for DataFrames?

Hi! I’m writing to check my understanding of a computational problem and see if I am approaching things correctly.

I am writing estimation code that needs to traverse data in multiple different ways. The typical approach to these problems in my field is to follow the split-apply-combine model and combine multiple dataframes in different ways to arrive at a likelihood (e.g. pyBLP as a Python example and a newer Julia package here). The technical details are irrelevant, but my particular setup is computationally infeasible following this approach.

What has worked for me is to collect my data into a sparse array and combine it with a set of lookups, essentially a set of dictionaries mapping values to cartesian indices. (E.g., I have a “time” dimension, so a time lookup would map a particular date to all of the nonzero indices in my sparse array that correspond to that date.) I can mimic the split-apply-combine operations that I need using sparse array operations and tensor operations from TensorOperations.jl, and my code is several orders of magnitude faster than with DataFrames.jl.

I am aware that this is isomorphic to indexing in the databases sense of the word. Storing all of my data in a SQL database and creating B-tree indexes where I need to, I can achieve the same performance as the (perhaps overengineered) solution I came up with above. But then that would involve managing a SQL database!

My question is: is there a feature of DataFrames.jl that I am missing or not understanding correctly in evaluating the performance tradeoffs here? Essentially, I want to replicate the functionality of grouped dataframes, but be able to pre-compute and store multiple groupings. Maybe with some nice syntactic sugar for handling iteration. Is it crazy to imagine something like this?

struct DataFrameIndex
    parent::DataFrame
    groupings::Dict{Vector{Symbol}, GroupedDataFrame}
end

# constructors...
df = ...
dfindex = create_index(df)
addindex!(dfindex, :foo)
addindex!(dfindex, [:bar, :foo])

function DataFrames.groupby(dfi::DataFrameIndex, cols)
    if haskey(dfi.groupings, cols)
        dfi.groupings[cols]
    else
        groupby(dfi.parent, cols)
    end
end

I don’t quite understand the laziness of group index computation in GroupedDataFrame but imagine there are some computational savings from reusing the same grouped dataframe instead of creating another. I think there could be savings when you see something like

subset(dfindex, in => f)
transform(dfindex, in => f => out)

where in is a pre-computed grouping and f is an instance of ByRow. You could just compute on the keys of a GroupedDataFrame’s keymap field, which is potentially a great deal less computation.

To be clear, I am not asking for this functionality to be implemented! I understand the lift involved. I am just asking whether I am crazy for hand-rolling a solution like this whenever there’s a computational need, or whether there is something that already exists that I am missing.

Thanks!

1 Like

The fact that groupby is lazy probably does not help in your case. What it does is that it avoids building a Dict in case it is not needed later (most notably, when later aggregations do not require grouping, which is the case for common operations like sum). In your case, if you need multiple groupby on different set of columns the approach you describe is the only one that makes currently sense. We thought of allowing groupby(df, [:a, :b]) to reuse groupby(df, :a) results if they were earlier available (i.e. doing a more fine-grained groupby) but this functionality has never been worked on.

1 Like

You might check duckdb.jl. I understand that duckdb supports multiple indexes per table.

They do! But it doesn’t work with their “dataframe scan” and I imagine not with their parquet file support for the same reason. Now I have to manage a SQL database again… at least this one is a flat file. There are many tools for the job :grinning_face:

using DuckDB, DataFrames
import DBInterface as DBI

df = DataFrame(a = 1:3)
con = DBI.connect(DuckDB.DB)
DuckDB.register_data_frame(con, df, "df")
DBI.execute(con, "create index foo on df (a)")
ERROR: Binder Error: can only create an index on a base table
Stacktrace:
 [1] DuckDB.Stmt(con::DuckDB.Connection, sql::String, result_type::Type)
   @ DuckDB ~/.julia/packages/DuckDB/AHUoM/src/statement.jl:18
 [2] prepare
   @ ~/.julia/packages/DuckDB/AHUoM/src/result.jl:853 [inlined]
 [3] prepare
   @ ~/.julia/packages/DuckDB/AHUoM/src/result.jl:854 [inlined]
 [4] prepare
   @ ~/.julia/packages/DuckDB/AHUoM/src/result.jl:855 [inlined]
 [5] execute
   @ ~/.julia/packages/DBInterface/nQcsk/src/DBInterface.jl:130 [inlined]
 [6] execute(conn::DuckDB.DB, sql::String)
   @ DBInterface ~/.julia/packages/DBInterface/nQcsk/src/DBInterface.jl:152
 [7] top-level scope
   @ REPL[6]:1

Thanks for the feedback! I imagine when you’re reusing the same GroupedDataFrame for multiple combine operations though, there is speed up, right? The grouping gets computed for the first operation and subsequent operations can take advantage (or, when you’re in the REPL, it seems like you can just print the GroupedDataFrame to force grouping :grinning_face:). Admittedly it’s a more niche use case, but I do find myself passing around a grouped dataframe occasionally for this reason.

This does not trigger grouping? Or are you talking about sum in a different context?

@> df groupby(:a) combine(:b => sum)

I’m marking Bogumil’s answer as a solution since I don’t think there’s any necessary follow-up here. I do want to note for posterity that there is also ndsparse from the IndexedTables package that could come in handy in situations like this. A use case with specific lookup/grouping needs is probably niche enough to ask the user to roll their own solution.

printing triggers grouping because to print groups you need to compute them.

combine(:b => sum) does not trigger grouping as to compute sum by group you do not need to find groups.

Once groups are computed once they are materialized and then they are re-used (i.e. are not recomputed).

You can verify this by checking size in memory of GroupedDataFrame (after groups are computed it grows once and then stays constant).