Apply a column of anonymous functions for each column in a column subset

babaq · April 12, 2022, 7:37pm

Hi,

Suppose we have a DataFrame like this:

df = DataFrame(name='a':'c',x1=1:3,x2=[[1,2],[3,4],[5,6]],xfun=[x->x.-1,x->x.^2,x->x.^3])

I want to transform column x1 and x2 in df by apply each fun in xfun for the corresponding row of x1 and x2, I could use [:x1,:xfun]=>ByRow((x,f)->f(x))=>:x1, but what if there are 20 of these columns, is there other elegant way to achieve this?

The other way i can think of is to convert the columns to a Matrix, and broadcasting a vector of anonymous functions to the first dimention of the Matrix, but i don’t know if there is a generic apply function to broadcast?

Thanks,
Alex

bkamins · April 12, 2022, 8:42pm

julia> combine(df, vcat.(["x1", "x2"], "xfun") .=> ByRow((x,f) -> f(x)) => first)
3×2 DataFrame
 Row │ x1     x2
     │ Int64  Array…
─────┼───────────────────
   1 │     0  [0, 1]
   2 │     4  [9, 16]
   3 │    27  [125, 216]

and instead of ["x1", "x2"] provide an expression that generates the column names you want to include.

babaq · April 12, 2022, 11:37pm

Thanks, it’s exactly what i want. The transform version works too, like this:

transform(df, vcat.(["x1", "x2"], "xfun") .=> ByRow((x,f) -> f(x)) => first)

is there other different between these two?

bkamins · April 13, 2022, 7:28am

The differences are:

transform keeps all source columns always; combine only keeps columns specified in transformations;
transform requires output to have as many rows as input; combine allows any number of rows in output.

Other than that these functions interpret transformation specifications in the same way (i.e. the same engine processes both requests, but different additional constraints are added)

rocco_sprmnt21 · April 13, 2022, 3:57pm

just a slightly different way of combining things

cols=["x1", "x2"]
combine(df, ["xfun";cols]=>ByRow((f,x...)->f.(x))=>cols)

but above all to ask for information on the use of the first function instead of a list of names / symbols of columns in output.

PS

I wonder if and when it will also be possible to write something like this

combine(df, [cols;"xfun"]=>ByRow((x...,f)->f.(x))=>cols)

# so for the given df is possible to save some typing :-)

combine(df, 2:4=>ByRow((x...,f)->f.(x))=>2:3)

babaq · April 13, 2022, 10:41pm

Thanks for the clarification!

babaq · April 13, 2022, 10:43pm

splitting to a vector of names is also quite concise.

bkamins · April 14, 2022, 6:51am

Base Julia does not allow this and I do not think it will be allowed.

rocco_sprmnt21 · April 14, 2022, 11:48am

I take this opportunity to ask you a further question, this one more specific one relating to the mini language.
If I understand correctly, some input forms such as columns range are not allowed in output.
For example 2: 3 => fun => 2: 3, it doesn’t work.
If so, what is the reason for these restrictions?

bkamins · April 14, 2022, 11:58am

This could work and would mean the following:

pass contents of columns 2 and 3 as positional arguments to function fun and expand the result returned by it into two columns whose names are taken as names of columns 2 and 3 from the source

The first question is if this is what you would expect. If this is what you would expect, at least for me this is a very specific case that is needed quite rarely and currently it can be expressed as 2:3 => fun => names(df, 2:3) which is only a bit more verbose.

For single column transformations like 2 => fun => 2 in your proposed notation, which are more common, either pass renamecols=false as kwarg and write just 2 => fun or write 2 => fun => identity to retain source column name. This does not cover the case like 2 => fun => 3, but again I think that it is quite rare.

What is your use case where you require this kind of transformations?

rocco_sprmnt21 · April 14, 2022, 12:43pm

the simple one: the first.
Obviously when I did the test I mixed something else.
Thanks

bkamins · April 14, 2022, 1:02pm

For a reference here is an example where your original syntax could be useful:

julia> using DataFrames

julia> fun(x, y) = map((a, b) -> (a+b, a-b), x, y)
fun (generic function with 1 method)

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> combine(df, [:a, :b] => fun => [:a, :b])
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     5     -3
   2 │     7     -3
   3 │     9     -3

Topic		Replies	Views
Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl New to Julia dataframes	9	1059	May 22, 2021
Apply function By Row without re-stating column names General Usage dataframes , functions	36	3397	May 9, 2022
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	547	December 7, 2022
Dataframe transform operation on multiple columns General Usage dataframes	10	4179	August 8, 2020
Apply transform() to all DataFrame columns of a certain type? General Usage dataframes	1	363	April 16, 2021

Apply a column of anonymous functions for each column in a column subset

Related topics