Joining GroupedDataFrames

s-baumann · December 31, 2022, 9:35am

I want to make a struct containing small dataframes that can efficiently merge into larger dataframes. Is there any out of the box solutions to cache the matching effort required for the joins.

I tried to do this using GroupedDataFrame (as they store the row numbers for each of the groups in the grouping columns which should result in a speed boost). But I cannot seem to do joins on these structs:

using DataFrames

# Making Example data
rows = 1000000
stock_day = DataFrame(stock = repeat([:A,:b], Int(rows/2)), day = sort(repeat(collect(1:Int(rows/2)),2)), price = rand(rows)  )
day = DataFrame(day =  collect(1:Int(rows/2)), temperature = rand(Int(rows/2)))
stock = DataFrame(stock = [:A,:b], country = [:AUS, :USA])
country = DataFrame(country =  [:AUS, :USA], GDP_per_capita = [5,6])

struct SplitData2
    frames::Dict{Symbol,GroupedDataFrame}
end
sd2 = SplitData2(Dict{Symbol,GroupedDataFrame}([:stock_day, :day, :stock, :country] .=> [groupby(stock_day, Symbol[:stock, :day]), 
                                                                                                        groupby(day, Symbol[:day]),
                                                                                                        groupby(stock, Symbol[:stock]),
                                                                                                        groupby(country, Symbol[:country]) ] )  )

function stock_day_gdp(sd::SplitData2)
    aa = leftjoin(sd.frames[:stock_day], sd.frames[:day], on = :day)
    bb = leftjoin(aa, sd.frames[:stock], on = [:stock])
    return bb
end
@time sg = stock_day_gdp(sd2)

Here I get ERROR: MethodError: no method matching leftjoin(::GroupedDataFrame{DataFrame}, ::GroupedDataFrame{DataFrame}; on=:day)

Is there a way efficiently merge GroupedDataFrames? If not is there a way to maintain a set of dataframes with preselected merger columns (with the merge matching being precomputed) that you can efficiently merge with?

bkamins · December 31, 2022, 11:00am

Currently not. This can be potentially added in the future.

with the merge matching being precomputed

This is a crucial challenge. Merge matching must be made when considering both sides of the join. Consider e.g. the following data frames:

julia> df1 = DataFrame(id=1:3)
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df2 = DataFrame(id=11:13)
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │    11
   2 │    12
   3 │    13

even if you run groupby on them it would not give any benefit for performing a join.

(though in some cases indeed there would be some benefit)

rocco_sprmnt21 · December 31, 2022, 12:06pm

It’s not that I have clear in mind what your need is.
But by reproducing the structure you used and substituting the SubDataFrames for the GroupDataFrames, the LeftJoin functions are applicable.
Whether that’s what you actually need (efficiency-wise and otherwise) is another matter entirely

                                                             
struct SplitData3
    frames::Dict{Symbol,DataFrame}
end                          
sd3 = SplitData3(Dict{Symbol,DataFrame}([:stock_day, :day, :stock, :country] .=> [combine(groupby(stock_day, [:stock, :day]), Ref),
                                                                                                        combine(groupby(day, :day), Ref),
                                                                                                        combine(groupby(stock, :stock),Ref),
                                                                                                        combine(groupby(country, :country),Ref) ] )  )

function stock_day_gdp(sd::SplitData3)
    aa = leftjoin(sd.frames[:stock_day], sd.frames[:day], on = :day,makeunique=true)
    bb = leftjoin(aa, sd.frames[:stock], on = [:stock],makeunique=true)
    return bb
end

sd=sd3
@time sg = stock_day_gdp(sd3)

Dan · December 31, 2022, 2:17pm

Sometimes it seems better to let Database software do its thing. I think this is one of those times.
Specifically, here is a link to a blog post by @bkamins
which I haven’t read yet but looks promising:
https://bkamins.github.io/julialang/2022/12/23/duckdb.html

s-baumann · January 1, 2023, 4:13am

Ah fair enough. I wonder if making the keymap an ordered dict (with ordering being done according to the group levels) might help. There would still be some computation done in matching it to whatever the other dataframe is but it seems like you would only have to pass through each of the dicts once to do it. So it might beat merging the parent dataframes without using the precomputed indices.

s-baumann · January 1, 2023, 4:17am

Thanks @rocco_sprmnt21 but on my machine that is slower than just merging the ungrouped dataframes.

Basically what I had in mind here was for analysis I could merge everything together at the start but then 1) it takes alot of ram 2) some computations are less efficient relative to computing on the fewer (unmerged) rows of data. Or I could leave them seperate which would improve these but in a large program I might merge them together a few times in different places. Using some sort of precomputed merge information seemed like it could be the best of both worlds.

@Dan thanks but I think I would need to do some manipulation of the data within Julia for the program I had in mind. I guess SQL is turing complete so I could just write an epic postgres query but I don’t think I am enough of an SQL guru for that.

rocco_sprmnt21 · January 1, 2023, 10:18am

Also try with this data structure… (the idea is the same: somehow change the group to make it digestible to the leftjoin function)


struct SplitData1
    frames::Dict{Symbol,DataFrame}
end

sd1 = SplitData1(Dict{Symbol,DataFrame}([:stock_day, :day, :stock, :country] .=> [DataFrame(groupby(stock_day, Symbol[:stock, :day])), 
                                                                                                        DataFrame(groupby(day, Symbol[:day])),
                                                                                                        DataFrame(groupby(stock, Symbol[:stock])),
                                                                                                        DataFrame(groupby(country, Symbol[:country])) ] )  )

function stock_day_gdp1(sd::SplitData1)
    aa = leftjoin(sd.frames[:stock_day], sd.frames[:day], on = :day)
    bb = leftjoin(aa, sd.frames[:stock], on = [:stock])
    return bb
end
julia> @time sg = stock_day_gdp1(sd1)
  0.082030 seconds (676 allocations: 112.568 MiB)
1000000×5 DataFrame
     Row │ stock   day     price      temperature  country 
         │ Symbol  Int64   Float64    Float64?     Symbol?
─────────┼─────────────────────────────────────────────────
       1 │ A            1  0.609843     0.572564   AUS
       2 │ b            1  0.570739     0.572564   USA
       3 │ A            2  0.803925     0.882174   AUS

PS
But are your groups in the real case only one row like in the given example?
Could you give an example with structure and dimensions similar to your case and better explain what kind of operations you have to do more often?

bkamins · January 1, 2023, 4:24pm

Such a dict is not created at all as it would be inefficient. Also in joins, if we do not have to we do not create a dict, nor hash columns on which data is merged either (though sometimes it is inevitable).

In general: you are right that if we had few large groups in both data frames it would be more efficient to use information from groupby to perform a join. I just gave an example where this is not beneficial. As I have commented in my first post - adding join support for GroupedDataFrame could be added - it is just not done yet and it is not trivial (so other functionalities took priority).

Topic		Replies	Views
DataFrame how to `groupby` then index with unspecified keys (merge them) General Usage question , dataframes	12	513	November 15, 2022
Grouped Data Frames General Usage dataframes , juliadb , grouped-data	8	420	February 7, 2023
Faster groupwise joins to complete implicitly missing rows Performance dataframes	2	334	April 16, 2021
Merge two dataframes together Data	1	9915	November 14, 2020
Combining a col from each DF group into a single DF New to Julia question , dataframes	5	296	August 25, 2022

Joining GroupedDataFrames

Related topics