The transform!
operator appears to result in large number of allocation as large slowdown compared to the non-grouped counterpart.
using DataFrames
using StatsBase: sample
using BenchmarkTools
df1 = DataFrame(rand(1_000_000, 100), :auto)
df1[:, :grp] .= sample(Int.(1:100), 1_000_000)
dfg1 = groupby(df1, ["grp"])
function test1(df)
transform!(df, "x9" => ((x) -> x .^ 2) => "x9B")
end
For regular DataFrame:
julia> @btime test1($df1);
1.469 ms (556 allocations: 7.66 MiB)
For GroupedDataFrame:
julia> @btime test1($dfg1);
388.991 ms (9003310 allocations: 311.65 MiB)
As it can be seen, the performance is actually quite bad on the GroupedDataFrame (250X times slower, 40X the allocation size), although my expectation would have been for a relatively modest overhead from operating on the 100 groups. Did I wrongly used transform!
or is there a real performance issue?
The above was run on DataFrames v1.1.0
.
1 Like
Thank you very much for reporting. Mostly fixed in fix performance issue in multirow split-apply-combine by bkamins · Pull Request #2749 · JuliaData/DataFrames.jl · GitHub (maybe it still can be improved but the major sources of problems are solved):
julia> @btime test1($df1);
890.798 μs (556 allocations: 7.66 MiB)
julia> @btime test1($dfg1);
25.061 ms (65066 allocations: 54.79 MiB)
3 Likes
I am down to:
julia> @btime test1($df1);
901.361 μs (554 allocations: 7.66 MiB)
julia> @btime test1($dfg1);
22.008 ms (4418 allocations: 44.94 MiB)
1 Like
Thanks for lot for the quick fix!
Regarding the remaining ~25X difference in execution time, does such gap falls within expectations? My intuition was that by having the DF sorted by the group, it would have reduced the difference vs the non-grouped DF to a fairly small amount sice each element would be accessed once in a straight sequence.
However, the performance doesn’t seem to change much compared to where the groups are randomly scattered around:
julia> @btime test1($dfg1);
352.164 ms (9003307 allocations: 302.20 MiB)
dfs = sort(df1, [:grp])
dfgs = groupby(dfs, ["grp"])
julia> @btime test1($dfgs);
296.777 ms (9003406 allocations: 302.20 MiB)
Regarding the remaining ~25X difference in execution time, does such gap falls within expectations?
I would prefer it to be smaller, but I do not see a quick fix for this now (I will have to think).
the performance doesn’t seem to change much compared to where the groups are randomly scattered around:
This is a good point. In joins we already take advantage of the data being sorted. In grouping we currently do not handle this yet but this is on a to-do list.
1 Like