Understanding DataFrame allocations

Hi there,

The split-apply-combine strategy is something that I use a lot so Im trying to understand a bit more what goes under the hood so that I can write better code. I was just looking at the following example:

df = DataFrame(x = rand(20), y = rand(20))
function test(df)
           df |>
           x -> transform!(x, :x => ByRow(val -> 2*val) => identity)
end 
function test2(df)
           df.x = 2 .* df.x
end

@btime on test gives
Screenshot 2024-11-18 at 14.01.09

While @btime on test2 gives
Screenshot 2024-11-18 at 14.01.36

My first question is: where are the 3 allocations coming from in the first case and the more important second question: why is there such a huge difference between the piping + transform! implementation? I though it was an in-place method.

Thanks!

The infrastructure for dealing with the src => fun => dest inputs can be somewhat complicated, and this leads to allocations.

Fortuntaely, this is a fixed cost. When a data frame gets bigger, the difference between the two functions disappears

julia> df = DataFrame(x = rand(1_000_000), y = rand(1_000_000));

julia> @btime test($df);
  936.341 μs (92 allocations: 7.63 MiB)

julia> @btime test2($df);
  928.734 μs (4 allocations: 7.63 MiB)
3 Likes