Joining GroupedDataFrames

I want to make a struct containing small dataframes that can efficiently merge into larger dataframes. Is there any out of the box solutions to cache the matching effort required for the joins.

I tried to do this using GroupedDataFrame (as they store the row numbers for each of the groups in the grouping columns which should result in a speed boost). But I cannot seem to do joins on these structs:

using DataFrames

# Making Example data
rows = 1000000
stock_day = DataFrame(stock = repeat([:A,:b], Int(rows/2)), day = sort(repeat(collect(1:Int(rows/2)),2)), price = rand(rows)  )
day = DataFrame(day =  collect(1:Int(rows/2)), temperature = rand(Int(rows/2)))
stock = DataFrame(stock = [:A,:b], country = [:AUS, :USA])
country = DataFrame(country =  [:AUS, :USA], GDP_per_capita = [5,6])

struct SplitData2
    frames::Dict{Symbol,GroupedDataFrame}
end
sd2 = SplitData2(Dict{Symbol,GroupedDataFrame}([:stock_day, :day, :stock, :country] .=> [groupby(stock_day, Symbol[:stock, :day]), 
                                                                                                        groupby(day, Symbol[:day]),
                                                                                                        groupby(stock, Symbol[:stock]),
                                                                                                        groupby(country, Symbol[:country]) ] )  )

function stock_day_gdp(sd::SplitData2)
    aa = leftjoin(sd.frames[:stock_day], sd.frames[:day], on = :day)
    bb = leftjoin(aa, sd.frames[:stock], on = [:stock])
    return bb
end
@time sg = stock_day_gdp(sd2)

Here I get ERROR: MethodError: no method matching leftjoin(::GroupedDataFrame{DataFrame}, ::GroupedDataFrame{DataFrame}; on=:day)

Is there a way efficiently merge GroupedDataFrames? If not is there a way to maintain a set of dataframes with preselected merger columns (with the merge matching being precomputed) that you can efficiently merge with?

Currently not. This can be potentially added in the future.

with the merge matching being precomputed

This is a crucial challenge. Merge matching must be made when considering both sides of the join. Consider e.g. the following data frames:

julia> df1 = DataFrame(id=1:3)
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df2 = DataFrame(id=11:13)
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │    11
   2 │    12
   3 │    13

even if you run groupby on them it would not give any benefit for performing a join.

(though in some cases indeed there would be some benefit)

It’s not that I have clear in mind what your need is.
But by reproducing the structure you used and substituting the SubDataFrames for the GroupDataFrames, the LeftJoin functions are applicable.
Whether that’s what you actually need (efficiency-wise and otherwise) is another matter entirely :grinning:

                                                             
struct SplitData3
    frames::Dict{Symbol,DataFrame}
end                          
sd3 = SplitData3(Dict{Symbol,DataFrame}([:stock_day, :day, :stock, :country] .=> [combine(groupby(stock_day, [:stock, :day]), Ref),
                                                                                                        combine(groupby(day, :day), Ref),
                                                                                                        combine(groupby(stock, :stock),Ref),
                                                                                                        combine(groupby(country, :country),Ref) ] )  )

function stock_day_gdp(sd::SplitData3)
    aa = leftjoin(sd.frames[:stock_day], sd.frames[:day], on = :day,makeunique=true)
    bb = leftjoin(aa, sd.frames[:stock], on = [:stock],makeunique=true)
    return bb
end

sd=sd3
@time sg = stock_day_gdp(sd3)

Sometimes it seems better to let Database software do its thing. I think this is one of those times.
Specifically, here is a link to a blog post by @bkamins
which I haven’t read yet but looks promising:
https://bkamins.github.io/julialang/2022/12/23/duckdb.html

Ah fair enough. I wonder if making the keymap an ordered dict (with ordering being done according to the group levels) might help. There would still be some computation done in matching it to whatever the other dataframe is but it seems like you would only have to pass through each of the dicts once to do it. So it might beat merging the parent dataframes without using the precomputed indices.

Thanks @rocco_sprmnt21 but on my machine that is slower than just merging the ungrouped dataframes.

Basically what I had in mind here was for analysis I could merge everything together at the start but then 1) it takes alot of ram 2) some computations are less efficient relative to computing on the fewer (unmerged) rows of data. Or I could leave them seperate which would improve these but in a large program I might merge them together a few times in different places. Using some sort of precomputed merge information seemed like it could be the best of both worlds.

@Dan thanks but I think I would need to do some manipulation of the data within Julia for the program I had in mind. I guess SQL is turing complete so I could just write an epic postgres query but I don’t think I am enough of an SQL guru for that.

Also try with this data structure… (the idea is the same: somehow change the group to make it digestible to the leftjoin function)


struct SplitData1
    frames::Dict{Symbol,DataFrame}
end

sd1 = SplitData1(Dict{Symbol,DataFrame}([:stock_day, :day, :stock, :country] .=> [DataFrame(groupby(stock_day, Symbol[:stock, :day])), 
                                                                                                        DataFrame(groupby(day, Symbol[:day])),
                                                                                                        DataFrame(groupby(stock, Symbol[:stock])),
                                                                                                        DataFrame(groupby(country, Symbol[:country])) ] )  )

function stock_day_gdp1(sd::SplitData1)
    aa = leftjoin(sd.frames[:stock_day], sd.frames[:day], on = :day)
    bb = leftjoin(aa, sd.frames[:stock], on = [:stock])
    return bb
end
julia> @time sg = stock_day_gdp1(sd1)
  0.082030 seconds (676 allocations: 112.568 MiB)
1000000×5 DataFrame
     Row │ stock   day     price      temperature  country 
         │ Symbol  Int64   Float64    Float64?     Symbol?
─────────┼─────────────────────────────────────────────────
       1 │ A            1  0.609843     0.572564   AUS
       2 │ b            1  0.570739     0.572564   USA
       3 │ A            2  0.803925     0.882174   AUS


PS
But are your groups in the real case only one row like in the given example?
Could you give an example with structure and dimensions similar to your case and better explain what kind of operations you have to do more often?

Such a dict is not created at all as it would be inefficient. Also in joins, if we do not have to we do not create a dict, nor hash columns on which data is merged either (though sometimes it is inevitable).

In general: you are right that if we had few large groups in both data frames it would be more efficient to use information from groupby to perform a join. I just gave an example where this is not beneficial. As I have commented in my first post - adding join support for GroupedDataFrame could be added - it is just not done yet and it is not trivial (so other functionalities took priority).