Hi! I’m writing to check my understanding of a computational problem and see if I am approaching things correctly.
I am writing estimation code that needs to traverse data in multiple different ways. The typical approach to these problems in my field is to follow the split-apply-combine model and combine multiple dataframes in different ways to arrive at a likelihood (e.g. pyBLP as a Python example and a newer Julia package here). The technical details are irrelevant, but my particular setup is computationally infeasible following this approach.
What has worked for me is to collect my data into a sparse array and combine it with a set of lookups, essentially a set of dictionaries mapping values to cartesian indices. (E.g., I have a “time” dimension, so a time lookup would map a particular date to all of the nonzero indices in my sparse array that correspond to that date.) I can mimic the split-apply-combine operations that I need using sparse array operations and tensor operations from TensorOperations.jl, and my code is several orders of magnitude faster than with DataFrames.jl.
I am aware that this is isomorphic to indexing in the databases sense of the word. Storing all of my data in a SQL database and creating B-tree indexes where I need to, I can achieve the same performance as the (perhaps overengineered) solution I came up with above. But then that would involve managing a SQL database!
My question is: is there a feature of DataFrames.jl that I am missing or not understanding correctly in evaluating the performance tradeoffs here? Essentially, I want to replicate the functionality of grouped dataframes, but be able to pre-compute and store multiple groupings. Maybe with some nice syntactic sugar for handling iteration. Is it crazy to imagine something like this?
struct DataFrameIndex
parent::DataFrame
groupings::Dict{Vector{Symbol}, GroupedDataFrame}
end
# constructors...
df = ...
dfindex = create_index(df)
addindex!(dfindex, :foo)
addindex!(dfindex, [:bar, :foo])
function DataFrames.groupby(dfi::DataFrameIndex, cols)
if haskey(dfi.groupings, cols)
dfi.groupings[cols]
else
groupby(dfi.parent, cols)
end
end
I don’t quite understand the laziness of group index computation in GroupedDataFrame but imagine there are some computational savings from reusing the same grouped dataframe instead of creating another. I think there could be savings when you see something like
subset(dfindex, in => f)
transform(dfindex, in => f => out)
where in is a pre-computed grouping and f is an instance of ByRow. You could just compute on the keys of a GroupedDataFrame’s keymap field, which is potentially a great deal less computation.
To be clear, I am not asking for this functionality to be implemented! I understand the lift involved. I am just asking whether I am crazy for hand-rolling a solution like this whenever there’s a computational need, or whether there is something that already exists that I am missing.
Thanks!