I want to make a struct containing small dataframes that can efficiently merge into larger dataframes. Is there any out of the box solutions to cache the matching effort required for the joins.
I tried to do this using GroupedDataFrame (as they store the row numbers for each of the groups in the grouping columns which should result in a speed boost). But I cannot seem to do joins on these structs:
using DataFrames
# Making Example data
rows = 1000000
stock_day = DataFrame(stock = repeat([:A,:b], Int(rows/2)), day = sort(repeat(collect(1:Int(rows/2)),2)), price = rand(rows) )
day = DataFrame(day = collect(1:Int(rows/2)), temperature = rand(Int(rows/2)))
stock = DataFrame(stock = [:A,:b], country = [:AUS, :USA])
country = DataFrame(country = [:AUS, :USA], GDP_per_capita = [5,6])
struct SplitData2
frames::Dict{Symbol,GroupedDataFrame}
end
sd2 = SplitData2(Dict{Symbol,GroupedDataFrame}([:stock_day, :day, :stock, :country] .=> [groupby(stock_day, Symbol[:stock, :day]),
groupby(day, Symbol[:day]),
groupby(stock, Symbol[:stock]),
groupby(country, Symbol[:country]) ] ) )
function stock_day_gdp(sd::SplitData2)
aa = leftjoin(sd.frames[:stock_day], sd.frames[:day], on = :day)
bb = leftjoin(aa, sd.frames[:stock], on = [:stock])
return bb
end
@time sg = stock_day_gdp(sd2)
Here I get ERROR: MethodError: no method matching leftjoin(::GroupedDataFrame{DataFrame}, ::GroupedDataFrame{DataFrame}; on=:day)
Is there a way efficiently merge GroupedDataFrames? If not is there a way to maintain a set of dataframes with preselected merger columns (with the merge matching being precomputed) that you can efficiently merge with?