This DataFrame consists of 1000 groups (grouped by a) of 1000 values each (stored in b). For each group, I would like to take the mean of the values, and subtract it.
Here are two functions that achieve this goal:
function subtract_average_slow!(df)
gdf = groupby(df, :a)
for g in gdf
mean_value = mean(g[!, :b])
g[!, :b] .-= mean_value
end
end
function subtract_average_fast!(df)
gdf = groupby(df, :a)
for g in gdf
values = g[!, :b]
mean_value = mean(values)
values .-= mean_value
end
end
Surprisingly (to me), the second function is significantly faster than the first:
I would like to gain some intuition where all the memory allocations come from in the first function, so that I can avoid such bottlenecks in the future.
I would assume that the first version does a lookup operation in g for each iteration of the broadcasted assignment (and thatβs type unstable). In the other version, you extract the column from the dataframe once and in the broadcast its type is known.
And DataFrames seems to define some special behavior for it for DataFrame so it does not fall back to getindex like it usually would. Thatβs why you get different behavior for when you use the intermediate variable vs not:
Do you have an idea as to why DataFrames defines special behaviour for dotview? Iβm hoping to get some intuition for when and why this βissueβ occurs.
This operation allows to potentially change the column type, so it must do extra allocations. Change it to:
g[:, :b] .-= mean_value
(which is in-place, and not replace) and things will be fast.
The g[!, :b] .-= mean_value version is needed if you wanted to de-mean column that holds Int.
The distinction between ! and : is a special feature of DataFrames.jl to allow to distinguish these two different (but similar) operations. ! is special for DataFrames.jl so it must be overridden. Standard dotview does not support ! at all.
Not to plug the package I maintain too much, but DataFramesMeta.jl provides a convenient syntax to do this kind of operation which should reduce allocations.
Itβs still a little bit slower than the fast! version, Iβm not really sure why. But its definitely faster than the slow version.
julia> function subtract_average_dfmeta!(df)
@chain df begin
@groupby :a
@transform! :b = :b .- mean(:b)
end
end;
julia> @btime subtract_average_dfmeta!($df);
23.111 ms (8001 allocations: 59.91 MiB)
julia> @btime subtract_average_fast!($df);
12.979 ms (12085 allocations: 23.29 MiB)