Subtracting mean from DataFrame column: Why so many allocations?

JWMeer · April 29, 2024, 9:49am

Consider the following DataFrame:

df = DataFrame()
df.a = repeat(1:1000, outer = 1000)
df.b = rand(1000 * 1000)

This DataFrame consists of 1000 groups (grouped by a) of 1000 values each (stored in b). For each group, I would like to take the mean of the values, and subtract it.

Here are two functions that achieve this goal:

function subtract_average_slow!(df)
    gdf = groupby(df, :a)
    for g in gdf
        mean_value = mean(g[!, :b])
        g[!, :b] .-= mean_value
    end
end

function subtract_average_fast!(df)
    gdf = groupby(df, :a)
    for g in gdf
        values = g[!, :b]
        mean_value = mean(values)
        values .-= mean_value
    end
end

Surprisingly (to me), the second function is significantly faster than the first:

julia> @time subtract_average_slow!(df)
  7.679187 seconds (26.10 k allocations: 7.481 GiB, 21.94% gc time)

julia> @time subtract_average_fast!(df)
  0.180345 seconds (12.10 k allocations: 23.290 MiB, 73.18% gc time)

I would like to gain some intuition where all the memory allocations come from in the first function, so that I can avoid such bottlenecks in the future.

jules · April 29, 2024, 10:17am

I would assume that the first version does a lookup operation in g for each iteration of the broadcasted assignment (and that’s type unstable). In the other version, you extract the column from the dataframe once and in the broadcast its type is known.

julia> Meta.@lower g[!, :b] .-= mean_value
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = Base.dotview(g, !, :b)
│   %2 = -
│   %3 = Base.getindex(g, !, :b)
│   %4 = Base.broadcasted(%2, %3, mean_value)
│   %5 = Base.materialize!(%1, %4)
└──      return %5
))))

dotview is defined here

github.com

JuliaLang/julia/blob/6023ad6718514c15b3297197757ae3d93b85270b/base/broadcast.jl#L1222-L1228


      
          # x[...] .= f.(y...) ---> broadcast!(f, dotview(x, ...), y...).
          # The dotview function defaults to getindex, but we override it in
          # a few cases to get the expected in-place behavior without affecting
          # explicit calls to view.   (All of this can go away if slices
          # are changed to generate views by default.)
          
          Base.@propagate_inbounds dotview(args...) = Base.maybeview(args...)

And DataFrames seems to define some special behavior for it for DataFrame so it does not fall back to getindex like it usually would. That’s why you get different behavior for when you use the intermediate variable vs not:

julia> using DataFrames

julia> df = DataFrame(x = [1, 2, 3])
3×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> dv = Base.dotview(df, !, :x)
DataFrames.LazyNewColDataFrame{Symbol, DataFrame}(3×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3, :x)

JWMeer · April 29, 2024, 1:03pm

Thank you for the insightful answer.

Do you have an idea as to why DataFrames defines special behaviour for dotview? I’m hoping to get some intuition for when and why this ‘issue’ occurs.

jules · April 29, 2024, 1:08pm

I don’t know, maybe a question for @bkamins

bkamins · April 29, 2024, 2:29pm

This operation allows to potentially change the column type, so it must do extra allocations. Change it to:

g[:, :b] .-= mean_value

(which is in-place, and not replace) and things will be fast.

The g[!, :b] .-= mean_value version is needed if you wanted to de-mean column that holds Int.

The distinction between ! and : is a special feature of DataFrames.jl to allow to distinguish these two different (but similar) operations. ! is special for DataFrames.jl so it must be overridden. Standard dotview does not support ! at all.

pdeffebach · April 29, 2024, 3:09pm

Not to plug the package I maintain too much, but DataFramesMeta.jl provides a convenient syntax to do this kind of operation which should reduce allocations.

It’s still a little bit slower than the fast! version, I’m not really sure why. But its definitely faster than the slow version.

julia> function subtract_average_dfmeta!(df)
           @chain df begin
               @groupby :a
               @transform! :b = :b .- mean(:b)
           end
       end;

julia> @btime subtract_average_dfmeta!($df);
  23.111 ms (8001 allocations: 59.91 MiB)

julia> @btime subtract_average_fast!($df);
  12.979 ms (12085 allocations: 23.29 MiB)

Topic		Replies	Views
Understanding DataFrame allocations Performance dataframes	1	65	November 18, 2024
Most efficient way to add new columns in each SubDataFrame of a GroupDataFrame Performance question , dataframes	6	729	October 27, 2022
Row-wise mean of columns in a DataFrame Data	4	1715	August 13, 2021
Accessing a column value from DataFrameRow allocates Performance dataframes	10	838	March 7, 2022
DataFrames - reduce allocations and improve speed Data question	5	869	May 22, 2022

Subtracting mean from DataFrame column: Why so many allocations?

Related topics