Faster groupwise joins to complete implicitly missing rows

danielw2904 · April 16, 2021, 1:17pm

I have a grouped DataFrame with dates and values but only observe dates in which a value was observed. I’d like to create the rows that are implicitly missing and have them as missing e.g. for interpolation later. I came up with the following code but was wondering if there is a faster way to do this?

df = DataFrame(
    g = ['a','a', 'b', 'b', 'c', 'c', 'c'], 
    date = [Date(2021,1,1), Date(2021,1,2), Date(2021,1,2), Date(2021,1,4), Date(2021,1,1),Date(2021,1,3) ,Date(2021,1,7)],
    v = rand(7)
)
alldates = DataFrame(date = minimum(df.date):Day(1):maximum(df.date))
gdf = groupby(df, :g)
combdf = DataFrame()
for g in gdf
    gout = leftjoin(alldates, g, on = :date)
    gout.g .= g.g[1]
    disallowmissing!(gout, :g)
    append!(combdf, gout, cols = :union)
end

nilshg · April 16, 2021, 2:17pm

Not sure if faster, but I think this is clearer:

julia> combdf2 = rename!(DataFrame(Iterators.product(alldates.date, unique(df.g))), [:date, :g])
julia> combdf2 = leftjoin(combdf2, df, on = [:date, :g])

julia> isequal(combdf, combdf2)
true

danielw2904 · April 16, 2021, 9:05pm

Thanks! It is not only clearer but also twice as fast in a quick benchmark!

Topic		Replies	Views
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	707	March 29, 2023
Transforming a daily DataFrame with missing values into a DataFrame with end-of-month values General Usage dataframes	11	181	November 22, 2024
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9351	January 1, 2025
Timings for different groupby approaches Data	2	657	January 23, 2020
Getting the min value of two dataframes with identical cols General Usage dataframes	5	619	February 27, 2024

Faster groupwise joins to complete implicitly missing rows

Related topics