Dataframes transform!

Hello!

What is right way to transform an dataframe in place with a function which return multiple columns?
I want to write it like
transform!(df, :A=>(x->f(x))=>[:B,:C])
but it doesnt work.

Edit If the function returns a vector of vectors, it is interpreted as a vector of rows, not columns. One solution is to return a matrix instead. Notice the space instead of a comma in x->[x.+1 x.^2]

julia> df
3Γ—1 DataFrame
 Row β”‚ x
     β”‚ Int64
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3

julia> transform!(df, :x=>(x->[x.+1 x.^2]) => [:y, :z])
3Γ—3 DataFrame
 Row β”‚ x      y      z
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      2      1
   2 β”‚     2      3      4
   3 β”‚     3      4      9
1 Like

Probably the one proposed by @skleinbo is the most typical way, but there are other ways


    transform!(df, :x=>ByRow(r->(y=r+1, z=r^2))=>AsTable)

    hcat(df,DataFrame(y=df.x.+1, z=df.x.^2))
    
    insertcols!(df,2, :y=>df.x.+1, :z=>df.x.^2)

The function can return β€œvector of vectors”, but they are interpreted as rows:

julia> df = DataFrame(A=1:3)
3Γ—1 DataFrame
 Row β”‚ A
     β”‚ Int64
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3

julia> transform!(df, :A => (x -> [[v+1, v+2] for v in x]) => [:B, :C])
3Γ—3 DataFrame
 Row β”‚ A      B      C
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      2      3
   2 β”‚     2      3      4
   3 β”‚     3      4      5

The general format of expected output with multiple output columns is:

  1. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix then columns are taken from the output columns.
  2. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function.
  3. If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it.
4 Likes

I take this opportunity to ask for some more details on the choices made regarding the possible outputs of β€œfun”.

#this 

transform!(df, :x => (x -> [(v+1, v+2) for v in x]) => [:B, :C])

# is equivalent to this

transform!(df, :x => (x -> [(B=v+1, C=v+2) for v in x]) => AsTable)

my question is why this (array of namedtuples) works

transform!(df, :x=>ByRow(r->(y=r+1, z=r^2))=>AsTable)

and this (namedtuple of arrays) not

transform!(df, :x=>r->(y = r.+1, z = r.^2)=>AsTable)

It works, you just have forgotten parentheses:

julia> df = DataFrame(x=1:3)
3Γ—1 DataFrame
 Row β”‚ x
     β”‚ Int64
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3

julia> transform!(df, :x => (r->(y = r.+1, z = r.^2)) => AsTable)
3Γ—3 DataFrame
 Row β”‚ x      y      z
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      2      1
   2 β”‚     2      3      4
   3 β”‚     3      4      9
1 Like

I tried to compare the various ways, but they all seem equivalent.
I couldn’t figure out how insertcols performs, as @btime fails on the second pass because it already finds the columns with the same name.
Has the option in the insertcols function been evaluated to overwrite an existing column?
If so, why was it discarded?

use initialization code for @btime.

This is the point of insertcols! that it should error in this case. If you want to overwrite an existing column use setindex! or setproperty! (i.e. just write df.col = vector or df[!, col] = vector).

or pass makeunique=true in insertcols!.