Most efficient way to add new columns in each SubDataFrame of a GroupDataFrame

phantom · October 26, 2022, 6:49am

Sorry kind of a beginners question again. Suppose I have a large GroupDataFrame and I would like to add columns to each SubDataFrame in the Group.

function build(GDF)
    for k = eachindex(GDF)
    GDF[k].newcol1 = function1(GDF[k].col1)
    GDF[k].newcol2 = function2(GDF[k].col2)
end

Is there a better way to go about this? The performance I am seeing is quite slow and I am not sure if this is the bottleneck or if it is an issue with the function themselves.

According to the documentation pre-allocating output should help sometimes but when I tried this out with a DataFrame performance actually slowed.

df3 = DataFrame(X = [1, 2, 3, 4], Y = [0, 1, 2, 4])

julia> @benchmark df3.A = df3.X + df3.Y
BenchmarkTools.Trial: 10000 samples with 960 evaluations.
 Range (min … max):  84.158 ns …  1.026 μs  ┊ GC (min … max): 0.00% … 89.01%
 Time  (median):     89.800 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.999 ns ± 39.546 ns  ┊ GC (mean ± σ):  1.77% ±  4.00%

  █▇▅▂▄▆▇▆▅▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▂▁▁  ▁▁ ▁                       ▂
  █████████████████████████████████████████████▇▇▆▇▇▇▇▆▇▇▆▆▆▇ █
  84.2 ns      Histogram: log(frequency) by time       137 ns <

 Memory estimate: 96 bytes, allocs estimate: 1.

Whereas preallocating a vector before hand yielded

julia> function f1(df)
       df.G = Vector{Int}(undef,4)
       df.G = df.X+df.Y
       end

f1 (generic function with 1 method)

julia> @benchmark f1(df3)
BenchmarkTools.Trial: 10000 samples with 919 evaluations.
 Range (min … max):  109.675 ns …   2.028 μs  ┊ GC (min … max): 0.00% … 92.06%
 Time  (median):     111.353 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   124.812 ns ± 108.725 ns  ┊ GC (mean ± σ):  5.75% ±  6.19%

  █▅▃▁▃▄▄▄▃▃▂▂▁▁▁▁▁▁▁▁▁  ▁                                      ▁
  █████████████████████████████▇▇▇▇▇▇█▇▇▆▆▆▆▆▆▄▆▆▆▅▄▅▃▄▅▂▄▄▅▅▃▅ █
  110 ns        Histogram: log(frequency) by time        178 ns <

 Memory estimate: 224 bytes, allocs estimate: 3.

and using transform seems to be even slower

@benchmark @transform(df3, :Q = :X .+:Y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.625 μs …  4.450 ms  ┊ GC (min … max): 0.00% … 99.20%
 Time  (median):     12.875 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.912 μs ± 61.162 μs  ┊ GC (mean ± σ):  5.75% ±  1.40%

  ▄▆▇█▇▇▆▅▄▄▃▃▂▂▂▃▂▂▂▂▁▁▁  ▁▁▂▁▁▁     ▁▁                      ▂
  █████████████████████████████████████████▇▇█▇▇▇▇▇▆▆▆▆▆▆▅▆▆▆ █
  11.6 μs      Histogram: log(frequency) by time      26.1 μs <

 Memory estimate: 9.20 KiB, allocs estimate: 168.

Any tips on what I’m doing wrong would be greatly appreciated thanks!

nilshg · October 26, 2022, 7:40am

I’m a little bit confused - your question is about adding columns to a SubDataFrame, but your benchmarking code is all about adding columns to a regular DataFrame. Are you concerned about a slowdown when adding columns to a SubDataFrame compared to working with a DataFrame or about the speed of adding a column to a DataFrame in general?

Also your first attempt runs in 84 nano(!)seconds and has one allocation, do you think that this is somehow “too slow”? If so, what would you expect? Preallocation is a useful strategy when you can re-use memory in a calculation, but in this case you want to create a new vector, so you will have to make at least one allocation.

bkamins · October 26, 2022, 10:07am

A standard way to write it would be:

transform!(gdf, :col1 => function1 => :newcol1, :col2 => function2 => :newcol2)

and this should be efficient.

Of course this assumes that your data frame is large enough. For very small data frames (as in your example) compilation and bookkeeping will be much more expensive than computations themselves.

phantom · October 26, 2022, 11:03am

Thanks so much for pointing this out! I was concerned about the speed of adding columns to a SubDataFrame but I used the DataFrame as a bench mark because the Group Data Frame I am working with is quite large and I assumed that the most efficient method to add columns to a DataFrame would apply equally to a Sub Data Frame. Is this an incorrect assumption? Also thanks for explaining the performance with respect to pre-allocation. Going over the documentation makes a lot more sense now. So is it safe to say that it is not the pre-allocation that improves performance but just preventing of re-allocating memory that improves performance?

phantom · October 26, 2022, 11:20am

Thanks so much Bogumił ! Sorry I’m still having a little difficulty understanding the pair notation even though I’ve seen it used quite often. Is it the case that for some column x in a DataFrame df the pairing operator => here runs the function on the column such that

:x => function === function(df.x)

whereas in the second instance of the pairing operator, it is acting as an assignment for :newcol? what happens if the function involves more than one column? Also thanks for clarifying the discrepancy in the performance of transform!

bkamins · October 26, 2022, 1:47pm

The operator works as follows:

[:source_column] => [function] => [:target_column]

which is (simlifying a bit but I understand you want a mental model) the same as:

df.target_column = function(df.source_column)

So as you can see the => operator shows the ETL data flow (if you happened to use data bases):

Extract a column
Transform data
Load the result into a data frame

Function can involve both many input columns and many output columns. Let me give you a concrete example:

julia> df = DataFrame(a=[1, 2, 3, 4], b=[4, 3, 2, 1])
4×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      3
   3 │     3      2
   4 │     4      1

julia> f(a, b) = extrema.(zip(a, b))
f (generic function with 1 method)

julia> f(df.a, df.b)
4-element Vector{Tuple{Int64, Int64}}:
 (1, 4)
 (2, 3)
 (2, 3)
 (1, 4)

julia> transform!(df, [:a, :b] => f => [:min, :max])
4×4 DataFrame
 Row │ a      b      min    max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      4      1      4
   2 │     2      3      2      3
   3 │     3      2      2      3
   4 │     4      1      1      4

As you can see the “extract” part can fetch multiple columns and the “load part” can store back multiple columns. The details here are more complex (and are related to AsTable wrapper). I recommend you to either read the documentation for explanation or my DataFrames.jl minilanguage explained | Blog by Bogumił Kamiński post.

phantom · October 27, 2022, 2:08am

Thanks the explanation and blog post is really helpful! looking forward to the hardcopy of your book in December!

Topic		Replies	Views
Overwrite the subdataframes made with a for loop Performance question	10	1188	July 26, 2021
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	709	March 29, 2023
Allocations and slow perf for Transform! on GroupedDataFrames Data	4	419	May 6, 2021
Fastest way to create new column in DataFrames.jl New to Julia	0	1590	September 2, 2020
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	571	December 7, 2022

Most efficient way to add new columns in each SubDataFrame of a GroupDataFrame

Related topics