I am wondering if there are better ways to append lots of DataFrames together in these two scenarios.
# dts is a vector of lots of dataframes (>1000)
# here is a MWE version of it
dts = [DataFrame(pid=fill(string(rand(UInt16)),N),id=collect(1:N),pass=rand(("yes","no"),N)) for N in rand(10:50,10)]
# SCENARIO 1
# in this first case, all dataframes have the same schema
dfmap = reduce((a,b)->vcat(a,b), filter(!isnothing,dts))
# SCENARIO 2
# in this second case, I call a function makenew which returns a transformed dataframe
# output from makenew will not generally have the same schema, so I use the :union option to keep all cols
# if it helps, all columns in dfnew will be Union{Missing, String}
function makenew(df)
seqdf = unstack(df,"pid","id","pass")
dropmissing!(seqdf,"pid")
return seqdf
end
dfnew = mapreduce(makenew,(a,b)->vcat(a,b; cols= :union), filter(!isnothing,dts))
It is panel data and each element in dts has many rows, so that both dfmap and dfnew will potentially be very large.
where your_data_frames should be already preprocessed (e.g. filtered or mapped). If you use this pattern reduce will pre-allocate appropriate data structures.
Note that you can pass to reduce the kwargs if you need e.g. to make a union of columns if data frames have different column sets.
Is this mentioned somewhere in the DataFrame manual? I know about the optimization for reduce but I didn’t realize there was a method for DataFrames that could handle keywords. (It’s in the docstring for reduce, I realize, but I don’t see it mentioned anywhere else).
reduce(::typeof(vcat),
dfs::Union{AbstractVector{<:AbstractDataFrame},
Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}};
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal,
source::Union{Nothing, Symbol, AbstractString,
Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing)
Efficiently reduce the given vector or tuple of AbstractDataFrames with vcat.
The column order, names, and types of the resulting DataFrame, and the behavior of cols and source keyword arguments follow the rules specified for vcat of
AbstractDataFrames.
Sorry, perhaps my comment wasn’t clear. I see that it’s in the docstring; I was wondering if there is a mention of it in the “manual” part of the documentation. Looks like there is no “concatenation” section though.
Ah - OK. Can you please open an issue, or even better make a PR (if you know what kind of content would be useful for you from a user’s perspective). Thank you!
@nilshg Yes I guess I just copied and pasted the anonymous version from the second case and deleted the keyword. Plenty of performance lost because of that in itself.
Given that reduce(vcat,Vector{DataFrame};kwargs) seems pretty performant, I’m going to try breaking the mapreduce into a ThreadsX.map and a reduce to see if that helps the second scenario.