Understanding DataFrame allocations

miguelborrero · November 18, 2024, 10:04pm

Hi there,

The split-apply-combine strategy is something that I use a lot so Im trying to understand a bit more what goes under the hood so that I can write better code. I was just looking at the following example:

df = DataFrame(x = rand(20), y = rand(20))
function test(df)
           df |>
           x -> transform!(x, :x => ByRow(val -> 2*val) => identity)
end 
function test2(df)
           df.x = 2 .* df.x
end

@btime on test gives
Screenshot 2024-11-18 at 14.01.09

While @btime on test2 gives
Screenshot 2024-11-18 at 14.01.36

My first question is: where are the 3 allocations coming from in the first case and the more important second question: why is there such a huge difference between the piping + transform! implementation? I though it was an in-place method.

Thanks!

pdeffebach · November 18, 2024, 10:16pm

The infrastructure for dealing with the src => fun => dest inputs can be somewhat complicated, and this leads to allocations.

Fortuntaely, this is a fixed cost. When a data frame gets bigger, the difference between the two functions disappears

julia> df = DataFrame(x = rand(1_000_000), y = rand(1_000_000));

julia> @btime test($df);
  936.341 μs (92 allocations: 7.63 MiB)

julia> @btime test2($df);
  928.734 μs (4 allocations: 7.63 MiB)

Topic		Replies	Views
DataFrames - reduce allocations and improve speed Data question	5	868	May 22, 2022
Allocations and slow perf for Transform! on GroupedDataFrames Data	4	412	May 6, 2021
Subtracting mean from DataFrame column: Why so many allocations? Performance memory-allocation , dataframes	5	310	April 29, 2024
Accessing a column value from DataFrameRow allocates Performance dataframes	10	831	March 7, 2022
Overwrite the subdataframes made with a for loop Performance question	10	1188	July 26, 2021

Understanding DataFrame allocations

Related topics