Subtracting mean from DataFrame column: Why so many allocations?

Consider the following DataFrame:

df = DataFrame()
df.a = repeat(1:1000, outer = 1000)
df.b = rand(1000 * 1000)

This DataFrame consists of 1000 groups (grouped by a) of 1000 values each (stored in b). For each group, I would like to take the mean of the values, and subtract it.

Here are two functions that achieve this goal:

function subtract_average_slow!(df)
    gdf = groupby(df, :a)
    for g in gdf
        mean_value = mean(g[!, :b])
        g[!, :b] .-= mean_value
    end
end

function subtract_average_fast!(df)
    gdf = groupby(df, :a)
    for g in gdf
        values = g[!, :b]
        mean_value = mean(values)
        values .-= mean_value
    end
end

Surprisingly (to me), the second function is significantly faster than the first:

julia> @time subtract_average_slow!(df)
  7.679187 seconds (26.10 k allocations: 7.481 GiB, 21.94% gc time)

julia> @time subtract_average_fast!(df)
  0.180345 seconds (12.10 k allocations: 23.290 MiB, 73.18% gc time)

I would like to gain some intuition where all the memory allocations come from in the first function, so that I can avoid such bottlenecks in the future.

2 Likes

I would assume that the first version does a lookup operation in g for each iteration of the broadcasted assignment (and that’s type unstable). In the other version, you extract the column from the dataframe once and in the broadcast its type is known.

julia> Meta.@lower g[!, :b] .-= mean_value
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = Base.dotview(g, !, :b)
β”‚   %2 = -
β”‚   %3 = Base.getindex(g, !, :b)
β”‚   %4 = Base.broadcasted(%2, %3, mean_value)
β”‚   %5 = Base.materialize!(%1, %4)
└──      return %5
))))

dotview is defined here

And DataFrames seems to define some special behavior for it for DataFrame so it does not fall back to getindex like it usually would. That’s why you get different behavior for when you use the intermediate variable vs not:

julia> using DataFrames

julia> df = DataFrame(x = [1, 2, 3])
3Γ—1 DataFrame
 Row β”‚ x
     β”‚ Int64
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3

julia> dv = Base.dotview(df, !, :x)
DataFrames.LazyNewColDataFrame{Symbol, DataFrame}(3Γ—1 DataFrame
 Row β”‚ x
     β”‚ Int64
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3, :x)
2 Likes

Thank you for the insightful answer.

Do you have an idea as to why DataFrames defines special behaviour for dotview? I’m hoping to get some intuition for when and why this β€˜issue’ occurs.

I don’t know, maybe a question for @bkamins

This operation allows to potentially change the column type, so it must do extra allocations. Change it to:

g[:, :b] .-= mean_value

(which is in-place, and not replace) and things will be fast.

The g[!, :b] .-= mean_value version is needed if you wanted to de-mean column that holds Int.

The distinction between ! and : is a special feature of DataFrames.jl to allow to distinguish these two different (but similar) operations. ! is special for DataFrames.jl so it must be overridden. Standard dotview does not support ! at all.

5 Likes

Not to plug the package I maintain too much, but DataFramesMeta.jl provides a convenient syntax to do this kind of operation which should reduce allocations.

It’s still a little bit slower than the fast! version, I’m not really sure why. But its definitely faster than the slow version.

julia> function subtract_average_dfmeta!(df)
           @chain df begin
               @groupby :a
               @transform! :b = :b .- mean(:b)
           end
       end;

julia> @btime subtract_average_dfmeta!($df);
  23.111 ms (8001 allocations: 59.91 MiB)

julia> @btime subtract_average_fast!($df);
  12.979 ms (12085 allocations: 23.29 MiB)
3 Likes