Combining lots of DataFrames, best approach?

tbeason · March 25, 2022, 3:14pm

I am wondering if there are better ways to append lots of DataFrames together in these two scenarios.

# dts is a vector of lots of dataframes (>1000)
# here is a MWE version of it
dts = [DataFrame(pid=fill(string(rand(UInt16)),N),id=collect(1:N),pass=rand(("yes","no"),N)) for N in rand(10:50,10)]

# SCENARIO 1
# in this first case, all dataframes have the same schema 
dfmap = reduce((a,b)->vcat(a,b), filter(!isnothing,dts))

# SCENARIO 2
# in this second case, I call a function makenew which returns a transformed dataframe
# output from makenew will not generally have the same schema, so I use the :union option to keep all cols
# if it helps, all columns in dfnew will be Union{Missing, String}
function makenew(df)
    seqdf = unstack(df,"pid","id","pass")
    dropmissing!(seqdf,"pid")
    return seqdf
end
dfnew = mapreduce(makenew,(a,b)->vcat(a,b; cols= :union), filter(!isnothing,dts))

It is panel data and each element in dts has many rows, so that both dfmap and dfnew will potentially be very large.

nilshg · March 25, 2022, 3:50pm

How is this different from reduce(vcat, dts)?

rocco_sprmnt21 · March 25, 2022, 5:05pm

DataFrame(reduce(append!, Tables.rowtable.(dts)))

for

dts = [DataFrame(pid=fill(string(rand(UInt16)),N),id=collect(1:N),
                pass=rand(("yes","no"),N)) for N in rand(10:50,1000)]

@btime dfmap = reduce((a,b)->vcat(a,b), filter(!isnothing,dts))
  92.453 ms (107418 allocations: 352.46 MiB)

@btime DataFrame(reduce(append!, Tables.rowtable.(dts)))
  1.539 ms (9533 allocations: 2.96 MiB)

bkamins · March 25, 2022, 5:06pm

as @nilshg commented a recommented pattern is:

reduce(vcat, your_data_frames)

where your_data_frames should be already preprocessed (e.g. filtered or mapped). If you use this pattern reduce will pre-allocate appropriate data structures.

Note that you can pass to reduce the kwargs if you need e.g. to make a union of columns if data frames have different column sets.

rocco_sprmnt21 · March 25, 2022, 5:13pm

could you show how to rewrite using reduce kwargs this expression, please?

reduce((a,b)->vcat(a,b; cols= :union), dts)

haberdashPI · March 25, 2022, 5:19pm

Is this mentioned somewhere in the DataFrame manual? I know about the optimization for reduce but I didn’t realize there was a method for DataFrames that could handle keywords. (It’s in the docstring for reduce, I realize, but I don’t see it mentioned anywhere else).

@rocco_sprmnt21: you would write this as

reduce(vcat, your_data_frames, cols=:union)

haberdashPI · March 25, 2022, 5:23pm

Also it appears that there is no method to handle these kwargs for mapreduce, just reduce: https://github.com/JuliaData/DataFrames.jl/issues/3028

bkamins · March 25, 2022, 5:24pm

Yes, just get the help on reduce:

  reduce(::typeof(vcat),
         dfs::Union{AbstractVector{<:AbstractDataFrame},
                    Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}};
         cols::Union{Symbol, AbstractVector{Symbol},
                     AbstractVector{<:AbstractString}}=:setequal,
         source::Union{Nothing, Symbol, AbstractString,
                       Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing)

  Efficiently reduce the given vector or tuple of AbstractDataFrames with vcat.

  The column order, names, and types of the resulting DataFrame, and the behavior of cols and source keyword arguments follow the rules specified for vcat of
  AbstractDataFrames.

haberdashPI · March 25, 2022, 5:26pm

Sorry, perhaps my comment wasn’t clear. I see that it’s in the docstring; I was wondering if there is a mention of it in the “manual” part of the documentation. Looks like there is no “concatenation” section though.

bkamins · March 25, 2022, 5:27pm

Ah - OK. Can you please open an issue, or even better make a PR (if you know what kind of content would be useful for you from a user’s perspective). Thank you!

tbeason · March 25, 2022, 6:05pm

@nilshg Yes I guess I just copied and pasted the anonymous version from the second case and deleted the keyword. Plenty of performance lost because of that in itself.

Given that reduce(vcat,Vector{DataFrame};kwargs) seems pretty performant, I’m going to try breaking the mapreduce into a ThreadsX.map and a reduce to see if that helps the second scenario.

Topic		Replies	Views
Efficiently creating a data frame that is made up of smaller data frames Modelling & Simulations dataframes , for-loop	5	549	September 11, 2022
Mapreduce, pass extra arguments to reduce/vcat of DataFrames General Usage dataframes	10	1280	March 3, 2022
Concatenating DataFrames in parallel Data	4	333	May 10, 2023
Convert dictionary of dataframes into single dataframe General Usage	5	1229	March 30, 2021
How do I append (row-bind) a collection of DataFrames together into one? New to Julia data	1	1802	September 6, 2019

Combining lots of DataFrames, best approach?

Related topics