Understanding the performance issue in combine() [DataFrames.jl]

I have the following performance issue with user defined function in combine(). Assume

df = DataFrame(g = rand(1:100,1000), x = rand(1000))

and

temp_fun(x) = sum(x)

the performance of combine() in the following two scenarios is very different, and I don’t understand how to overcome this

using BenchmarkTools
@btime combine(groupby(df, :g), :x=>sum);
29.429 μs (188 allocations: 41.00 KiB)
@btime combine(groupby(df, :g), :x=>temp_fun);
64.421 μs (1020 allocations: 78.31 KiB)

Any suggestion? thanks.

DataFrames has fast path implementations for groupby with certain functions like sum - you can see them in the code here

Currently I don’t think there’s a public API way to opt-in for your own functions. If the op can be expressed as a reduction (e.g. Base.add_sum for sum), then you could replicate what DataFrames is doing, but that’s an internal API.

1 Like