Combining lots of DataFrames, best approach?

I am wondering if there are better ways to append lots of DataFrames together in these two scenarios.

# dts is a vector of lots of dataframes (>1000)
# here is a MWE version of it
dts = [DataFrame(pid=fill(string(rand(UInt16)),N),id=collect(1:N),pass=rand(("yes","no"),N)) for N in rand(10:50,10)]

# SCENARIO 1
# in this first case, all dataframes have the same schema 
dfmap = reduce((a,b)->vcat(a,b), filter(!isnothing,dts))

# SCENARIO 2
# in this second case, I call a function makenew which returns a transformed dataframe
# output from makenew will not generally have the same schema, so I use the :union option to keep all cols
# if it helps, all columns in dfnew will be Union{Missing, String}
function makenew(df)
    seqdf = unstack(df,"pid","id","pass")
    dropmissing!(seqdf,"pid")
    return seqdf
end
dfnew = mapreduce(makenew,(a,b)->vcat(a,b; cols= :union), filter(!isnothing,dts))

It is panel data and each element in dts has many rows, so that both dfmap and dfnew will potentially be very large.

How is this different from reduce(vcat, dts)?

DataFrame(reduce(append!, Tables.rowtable.(dts)))

for

dts = [DataFrame(pid=fill(string(rand(UInt16)),N),id=collect(1:N),
                pass=rand(("yes","no"),N)) for N in rand(10:50,1000)]
@btime dfmap = reduce((a,b)->vcat(a,b), filter(!isnothing,dts))
  92.453 ms (107418 allocations: 352.46 MiB)

@btime DataFrame(reduce(append!, Tables.rowtable.(dts)))
  1.539 ms (9533 allocations: 2.96 MiB)
1 Like

as @nilshg commented a recommented pattern is:

reduce(vcat, your_data_frames)

where your_data_frames should be already preprocessed (e.g. filtered or mapped). If you use this pattern reduce will pre-allocate appropriate data structures.

Note that you can pass to reduce the kwargs if you need e.g. to make a union of columns if data frames have different column sets.

1 Like

could you show how to rewrite using reduce kwargs this expression, please?

reduce((a,b)->vcat(a,b; cols= :union), dts)

Is this mentioned somewhere in the DataFrame manual? I know about the optimization for reduce but I didn’t realize there was a method for DataFrames that could handle keywords. (It’s in the docstring for reduce, I realize, but I don’t see it mentioned anywhere else).

@rocco_sprmnt21: you would write this as

reduce(vcat, your_data_frames, cols=:union)

Also it appears that there is no method to handle these kwargs for mapreduce, just reduce: https://github.com/JuliaData/DataFrames.jl/issues/3028

Yes, just get the help on reduce:

  reduce(::typeof(vcat),
         dfs::Union{AbstractVector{<:AbstractDataFrame},
                    Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}};
         cols::Union{Symbol, AbstractVector{Symbol},
                     AbstractVector{<:AbstractString}}=:setequal,
         source::Union{Nothing, Symbol, AbstractString,
                       Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing)

  Efficiently reduce the given vector or tuple of AbstractDataFrames with vcat.

  The column order, names, and types of the resulting DataFrame, and the behavior of cols and source keyword arguments follow the rules specified for vcat of
  AbstractDataFrames.

Sorry, perhaps my comment wasn’t clear. I see that it’s in the docstring; I was wondering if there is a mention of it in the “manual” part of the documentation. Looks like there is no “concatenation” section though.

Ah - OK. Can you please open an issue, or even better make a PR (if you know what kind of content would be useful for you from a user’s perspective). Thank you!

@nilshg Yes I guess I just copied and pasted the anonymous version from the second case and deleted the keyword. Plenty of performance lost because of that in itself.

Given that reduce(vcat,Vector{DataFrame};kwargs) seems pretty performant, I’m going to try breaking the mapreduce into a ThreadsX.map and a reduce to see if that helps the second scenario.

1 Like