Most efficient way to add new columns in each SubDataFrame of a GroupDataFrame

Sorry kind of a beginners question again. Suppose I have a large GroupDataFrame and I would like to add columns to each SubDataFrame in the Group.

function build(GDF)
    for k = eachindex(GDF)
    GDF[k].newcol1 = function1(GDF[k].col1)
    GDF[k].newcol2 = function2(GDF[k].col2)
end 

Is there a better way to go about this? The performance I am seeing is quite slow and I am not sure if this is the bottleneck or if it is an issue with the function themselves.

According to the documentation pre-allocating output should help sometimes but when I tried this out with a DataFrame performance actually slowed.

df3 = DataFrame(X = [1, 2, 3, 4], Y = [0, 1, 2, 4])
julia> @benchmark df3.A = df3.X + df3.Y
BenchmarkTools.Trial: 10000 samples with 960 evaluations.
 Range (min … max):  84.158 ns …  1.026 ΞΌs  β”Š GC (min … max): 0.00% … 89.01%
 Time  (median):     89.800 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   94.999 ns Β± 39.546 ns  β”Š GC (mean Β± Οƒ):  1.77% Β±  4.00%

  β–ˆβ–‡β–…β–‚β–„β–†β–‡β–†β–…β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–  ▁▁ ▁                       β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–‡β–‡β–†β–†β–†β–‡ β–ˆ
  84.2 ns      Histogram: log(frequency) by time       137 ns <

 Memory estimate: 96 bytes, allocs estimate: 1.

Whereas preallocating a vector before hand yielded

julia> function f1(df)
       df.G = Vector{Int}(undef,4)
       df.G = df.X+df.Y
       end

f1 (generic function with 1 method)

julia> @benchmark f1(df3)
BenchmarkTools.Trial: 10000 samples with 919 evaluations.
 Range (min … max):  109.675 ns …   2.028 ΞΌs  β”Š GC (min … max): 0.00% … 92.06%
 Time  (median):     111.353 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   124.812 ns Β± 108.725 ns  β”Š GC (mean Β± Οƒ):  5.75% Β±  6.19%

  β–ˆβ–…β–ƒβ–β–ƒβ–„β–„β–„β–ƒβ–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–  ▁                                      ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–‡β–‡β–†β–†β–†β–†β–†β–†β–„β–†β–†β–†β–…β–„β–…β–ƒβ–„β–…β–‚β–„β–„β–…β–…β–ƒβ–… β–ˆ
  110 ns        Histogram: log(frequency) by time        178 ns <

 Memory estimate: 224 bytes, allocs estimate: 3.

and using transform seems to be even slower

@benchmark @transform(df3, :Q = :X .+:Y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.625 ΞΌs …  4.450 ms  β”Š GC (min … max): 0.00% … 99.20%
 Time  (median):     12.875 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   14.912 ΞΌs Β± 61.162 ΞΌs  β”Š GC (mean Β± Οƒ):  5.75% Β±  1.40%

  β–„β–†β–‡β–ˆβ–‡β–‡β–†β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–β–β–  ▁▁▂▁▁▁     ▁▁                      β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–†β–†β–†β–…β–†β–†β–† β–ˆ
  11.6 ΞΌs      Histogram: log(frequency) by time      26.1 ΞΌs <

 Memory estimate: 9.20 KiB, allocs estimate: 168.

Any tips on what I’m doing wrong would be greatly appreciated thanks!

I’m a little bit confused - your question is about adding columns to a SubDataFrame, but your benchmarking code is all about adding columns to a regular DataFrame. Are you concerned about a slowdown when adding columns to a SubDataFrame compared to working with a DataFrame or about the speed of adding a column to a DataFrame in general?

Also your first attempt runs in 84 nano(!)seconds and has one allocation, do you think that this is somehow β€œtoo slow”? If so, what would you expect? Preallocation is a useful strategy when you can re-use memory in a calculation, but in this case you want to create a new vector, so you will have to make at least one allocation.

1 Like

A standard way to write it would be:

transform!(gdf, :col1 => function1 => :newcol1, :col2 => function2 => :newcol2)

and this should be efficient.

Of course this assumes that your data frame is large enough. For very small data frames (as in your example) compilation and bookkeeping will be much more expensive than computations themselves.

1 Like

Thanks so much for pointing this out! I was concerned about the speed of adding columns to a SubDataFrame but I used the DataFrame as a bench mark because the Group Data Frame I am working with is quite large and I assumed that the most efficient method to add columns to a DataFrame would apply equally to a Sub Data Frame. Is this an incorrect assumption? Also thanks for explaining the performance with respect to pre-allocation. Going over the documentation makes a lot more sense now. So is it safe to say that it is not the pre-allocation that improves performance but just preventing of re-allocating memory that improves performance?

Thanks so much BogumiΕ‚ ! Sorry I’m still having a little difficulty understanding the pair notation even though I’ve seen it used quite often. Is it the case that for some column x in a DataFrame df the pairing operator => here runs the function on the column such that

:x => function === function(df.x)

whereas in the second instance of the pairing operator, it is acting as an assignment for :newcol? what happens if the function involves more than one column? Also thanks for clarifying the discrepancy in the performance of transform!

1 Like

The operator works as follows:

[:source_column] => [function] => [:target_column]

which is (simlifying a bit but I understand you want a mental model) the same as:

df.target_column = function(df.source_column)

So as you can see the => operator shows the ETL data flow (if you happened to use data bases):

  • Extract a column
  • Transform data
  • Load the result into a data frame

Function can involve both many input columns and many output columns. Let me give you a concrete example:

julia> df = DataFrame(a=[1, 2, 3, 4], b=[4, 3, 2, 1])
4Γ—2 DataFrame
 Row β”‚ a      b
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1      4
   2 β”‚     2      3
   3 β”‚     3      2
   4 β”‚     4      1

julia> f(a, b) = extrema.(zip(a, b))
f (generic function with 1 method)

julia> f(df.a, df.b)
4-element Vector{Tuple{Int64, Int64}}:
 (1, 4)
 (2, 3)
 (2, 3)
 (1, 4)

julia> transform!(df, [:a, :b] => f => [:min, :max])
4Γ—4 DataFrame
 Row β”‚ a      b      min    max
     β”‚ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 β”‚     1      4      1      4
   2 β”‚     2      3      2      3
   3 β”‚     3      2      2      3
   4 β”‚     4      1      1      4

As you can see the β€œextract” part can fetch multiple columns and the β€œload part” can store back multiple columns. The details here are more complex (and are related to AsTable wrapper). I recommend you to either read the documentation for explanation or my DataFrames.jl minilanguage explained | Blog by BogumiΕ‚ KamiΕ„ski post.

1 Like

Thanks the explanation and blog post is really helpful! looking forward to the hardcopy of your book in December!