Efficient way to add column to dataframe computed from prior columns

I was struggling to find some clean way to code adding a named column to an existing dataframe computed from some of its columns. While I started posting this question I stumbled upon the official docs description of combine/transform and I was quite happy with using transform as posted below. But still, I am wondering if there are other ways to code this operation that are clean and possibly more performant(cpu usage/memory).

Also, does anybody have a good list of nice/terse/clean julia formulas for creating and transforming dataframes? What I am looking for is really a bunch of code that I can try out in the julia repl to get a better understanding of the syntax.

julia> df = DataFrame(X = [1, 2, 3, 4], Y = [0, 1, 2, 4])
4Γ—2 DataFrame
 Row β”‚ X      Y     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      0
   2 β”‚     2      1
   3 β”‚     3      2
   4 β”‚     4      4

julia> add = (a, b) -> a + b
#7 (generic function with 1 method)

julia> transform!(df, :, [:X, :Y] => add => :Z)
4Γ—3 DataFrame
 Row β”‚ X      Y      Z     
     β”‚ Int64  Int64  Int64 
─────┼─────────────────────
   1 β”‚     1      0      1
   2 β”‚     2      1      3
   3 β”‚     3      2      5
   4 β”‚     4      4      8

You can do df.Z = df.X + df.Y. For more complicated operations, I like Chain.jl plus either DataFrameMacros.jl or DataFramesMeta.jl (both provide similar tools, but the former operates by-row while the latter operates on columns by default). For example,

julia> using DataFrameMacros, Chain

julia> @chain df begin
           @transform!(:Z = add(:X, :Y))
       end
4Γ—3 DataFrame
 Row β”‚ X      Y      Z
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      0      1
   2 β”‚     2      1      3
   3 β”‚     3      2      5
   4 β”‚     4      4      8

If you don’t need to chain multiple operations together, you can omit @chain:

julia> @transform!(df, :Z = add(:X, :Y))
4Γ—3 DataFrame
 Row β”‚ X      Y      Z
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      0      1
   2 β”‚     2      1      3
   3 β”‚     3      2      5
   4 β”‚     4      4      8
1 Like

I was hoping there was a way to express this in a more declarative fashion like you have done!

Note, DataFramesMeta.jl now exports the macros @rtransform, @rselect, @rsubset, and @rorderby. So it now has feature parity for row-wise operations (with the addition of the letter r).

using DataFramesMeta # also exports Chain.jl
@chain df begin 
    @rtransform! :Z = :X + :Y
end

Personally I prefer the non-mutating form transform over the mutating transform! and the base DataFrames library rather than macros. Both of these choices simplify my code.

And I’m another +1 for Chain.jl.

1 Like

It might be nice to mention the β€˜declarative’ form more prominently in the docs. As far as I can tell, it’s mostly described here as a way to create a DataFrame from scratch. The docstring for transform! is ~1600 words long and applies to four distinct methods, includes many conditional clauses, and has no examples, which may be a bit intimidating for a newcomer. It may be worth including an example in the transform! docs that uses both df.Z = df.X + df.Y and transform!(df, :, [:X, :Y] => + => :Z) to intercept users before they get buried in contemplation of which of the seven allowable forms of args... to apply.

1 Like

This video might be relevant: https://www.youtube.com/watch?v=rDvpLFxcL84

But I think DataFrameMacros.jl and DataFramesMeta.jl are both good for what you want.

I think it’s a good decision for DataFrames.jl to remain macro-less cos then macro convenience packages can build various API that people can use.

2 Likes