Output to multiple target columns via transform in DataFrames Minilanguage

phantom · December 14, 2022, 6:12am

Suppose I have some function fun that takes in multiple columns of a GroupedDataFrame and outputs a series of calculations with additional parameters. I would like to output these calculations to new columns of the GroupedDataFrame via transform so that it would look something like this.

transform!(GDF, [:A, :B ] => (a, b ) -> fun(a,b, p1,p2) => [:NewA, :NewB:, NewAB]

What type of output does fun need to return so that the calculations may output to the newly named columns?

after reading bkamins very informative post on DataFrames minilanguage https://bkamins.github.io/julialang/2020/12/24/minilanguage.html

I tried tried having the function return a tuple

function fun(A,B, p1,p2)
    ...
    return (NewA, NewB, NewAB)

named tuple

   return NewCols = (A = NewA , B = NewB, AB = NewAB)

Vector

    return [NewA, NewB, NewAB]

However the output always gets bunched together into an auto-named single column once it is used in transform! When I tried returning a DataFrame

   return NewCols = DataFrame(A = NewA , B = NewB, AB = NewAB)

The entire DataFrame was outputted to the newly generated column where each row of the newly generated column was a copy of the entire DataFrame.

I am not sure if my mistake is the format of what fun is returning or if it is how I have the mini language written? Any pointers on where I am going wrong would be greatly appreciated! Thanks!

Jollywatt · December 14, 2022, 8:07am

The DataFrames minilanguage provides AsTable for this purpose (see the Multiple Target Columns section again).

A concrete example:

using RDatasets
using DataFrames

gdf = groupby(dataset("datasets", "iris"), :Species)

function fun(A, B)
	return (SepalMin = min.(A, B), SepalMax = max.(A, B))
end

transform(gdf, [:SepalLength, :SepalWidth] => fun => AsTable)

Alternatively, you can use ByRow on the function and specify multiple output columns. In this case, the function acts on each row separately (does not need to be vectorized) and can return any interable (here, a plain tuple).

transform(gdf,
    [:SepalLength, :SepalWidth] =>
    ByRow((a, b) -> (min(a, b), max(a, b))) =>
    [:SepalMin, :SepalMax]
)

phantom · December 14, 2022, 10:54am

Thanks! Yes I tried using AsTable at first with the following format.

transform(GDF, [:A, :B] => (x,y) -> fun(x, y, p1,p2) => AsTable)

But, as with the named tuple output, transform ended up grouping the output into a single column.

My understanding from the post is that output that is either a AbstractDataFrame, DataFrameRow, AbstractMatrix or NamedTuple, should output to multiple target columns? So if my function is returning one of those values shouldn’t I be able to output to multiple columns via transform?

Thanks also for the ByRow option. I’ll look into it further, the issue is that the function uses input from more than one row of a given column in calculating the output so having the function output the entire column seems like it might be more efficient compared to iterating ByRow of the DataFrame?

bkamins · December 14, 2022, 11:15am

Let me write what I understand you want. But if you want something else please comment:

julia> gdf = groupby(DataFrame(id=1:3, A=11:13, B=101:103), :id)
GroupedDataFrame with 3 groups based on key: id
First Group (1 row): id = 1
 Row │ id     A      B
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11    101
⋮
Last Group (1 row): id = 3
 Row │ id     A      B
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3     13    103

julia> fun(a, b, p1, p2) = (x=a*p1, y=b*p2, z=(a+b)*p1*p2)
fun (generic function with 1 method)

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun(a,b, p1,p2)) => AsTable)
3×6 DataFrame
 Row │ id     A      B      x      y       z
     │ Int64  Int64  Int64  Int64  Int64   Int64
─────┼─────────────────────────────────────────────
   1 │     1     11    101    110  101000  1120000
   2 │     2     12    102    120  102000  1140000
   3 │     3     13    103    130  103000  1160000

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun(a,b, p1,p2)) => [:NewA, :NewB, :NewAB])
3×6 DataFrame
 Row │ id     A      B      NewA   NewB    NewAB
     │ Int64  Int64  Int64  Int64  Int64   Int64
─────┼─────────────────────────────────────────────
   1 │     1     11    101    110  101000  1120000
   2 │     2     12    102    120  102000  1140000
   3 │     3     13    103    130  103000  1160000

and a second example with matrix:

julia> fun2(a, b, p1, p2) = [a*p1 b*p2 (a+b)*p1*p2]
fun2 (generic function with 1 method)

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun2(a,b, p1,p2)) => [:NewA, :NewB, :NewAB])
3×6 DataFrame
 Row │ id     A      B      NewA   NewB    NewAB
     │ Int64  Int64  Int64  Int64  Int64   Int64
─────┼─────────────────────────────────────────────
   1 │     1     11    101    110  101000  1120000
   2 │     2     12    102    120  102000  1140000
   3 │     3     13    103    130  103000  1160000

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun2(a,b, p1,p2)) => AsTable)
3×6 DataFrame
 Row │ id     A      B      x1     x2      x3
     │ Int64  Int64  Int64  Int64  Int64   Int64
─────┼─────────────────────────────────────────────
   1 │     1     11    101    110  101000  1120000
   2 │     2     12    102    120  102000  1140000
   3 │     3     13    103    130  103000  1160000

phantom · December 14, 2022, 2:10pm

Thanks so much for taking the time to make this detailed example! It was instrumental in illustrating my mistake. I thought the issue was that transform wouldn’t work if an array was passed as a parameter into fun without some modification to the output type but it all works fine now with the template you provided.

In the off chance it will help other novices

I had incorrectly used

transform(gdf, [:A, :B ] => (a, b ) -> fun(a,b, p1,p2) => [:NewA, :NewB, :NewAB])

as opposed to

transform(gdf, [:A, :B ] => ((a, b ) -> fun(a,b, p1,p2)) => [:NewA, :NewB, :NewAB])

missing the () around (a, b ) -> fun(a,b, p1,p2) put the output of fun into a single column of the GroupedDataFrame.

bkamins · December 14, 2022, 3:53pm

Yes - the ( and ) are needed because of operator precedence rules in Julia.

phantom · December 14, 2022, 4:15pm

got it! appreciate the clarification.

rocco_sprmnt21 · December 14, 2022, 4:42pm

This is a confusion that very often occurs even to those who know the rule.
I wonder if there isn’t a way to change the precedence rules locally to the minilanguage (still using the same “->” symbols for the function and “=>” for the pairs).
Obviously in that case it would be necessary to use the parentheses in case you want the pairs as output of the function. But being an explicit choice it would be less prone to “distractions”.

bkamins · December 14, 2022, 10:27pm

I do not think it is technically possible unfortunately.

adienes · December 14, 2022, 10:44pm

precedence of -> is surprisingly low sometimes. this is also an issue when combining with |>

rocco_sprmnt21 · December 15, 2022, 8:12am

In the global context of Julia, what are the advantages (and therefore the contraindications to any changes) of the current rules of precedence?

rocco_sprmnt21 · December 15, 2022, 8:14am

since we have “->” with lower precedence than “=>”, what do you think of a super function indicated with “+>” :), to use in case you want a different order from the standard?

nilshg · December 15, 2022, 9:12am

Irrespective of the merits, this would be a pretty massively breaking change to the parser affecting loads of Julia code (whether in the context of DataFrames or not) and so won’t happen for 1.0

Also a key design goal for DataFrames is consistency with base Julia, so having different operator precedence context dependent within DataFrames runs counter to that. This is the domain of macros, where the @ clearly signals “this bit of code does not mean what you might think it means!”

rocco_sprmnt21 · December 15, 2022, 10:34am

ok.
What I can’t evaluate is the reasons WHY “=>” has higher precedence than “->” and therefore, in addition to compatibility problems with already written code, what would be the contraindications to changing the order of precedence.
In the context of the minilanguage I would say that in the vast majority of cases (99%?) the opposite of the current one would be preferable.

I would also say that in the case of the mini-language no solution (whether macro or otherwise) that requires explicit user intervention would be really useful, as once the user realizes the precedence problem, he just encloses it in parentheses " (", “)” the function.

bkamins · December 15, 2022, 11:45am

In general this was discussed some time ago, but AFAICT:

=> has to have lower precedence than e.g. =
= has to have lower precedence than ->

A complete example involving all (artificial but showing the issue) is:

julia> map(x -> x = 10 => 20, 1:2)
2-element Vector{Pair{Int64, Int64}}:
 10 => 20
 10 => 20

where you clearly do not want this to be parsed as:

julia> map(((x -> x) = 10) => 20, 1:2)
ERROR: syntax: invalid assignment location "x -> begin

rocco_sprmnt21 · December 17, 2022, 9:11pm

So the fact that => has lower(*) precedence than → comes from the transitivity of precedence ordering?

(*)
In my own language I would have used the term “higher precedence” in these cases, if I haven’t misunderstood something.

bkamins · December 17, 2022, 10:21pm

I mixed up “lower” with “higher” in my comment - you are right.

aplavin · December 18, 2022, 8:16am

Don’t think transitivity is required here. I (and likely many others) definitely prefer x -> first(x) => length(x) to parse as a function, as it does now.

rocco_sprmnt21 · December 18, 2022, 10:54am

I fail to assess the practical relevance of such an example.
I would expect more of a situation like this

[first(x) => length(x) for x in vecofvec]

aplavin · December 18, 2022, 12:23pm

first and length are just arbitrary examples, no specific meaning attached here.

It’s just that functional forms are more easily composable than comprehensions, so I’m basically talking about stuff like map(x -> x.a => x.b, X). Pretty useful when creating dictionaries, among other things.

And what are common usecases when you’d like the opposite precedence order? Aside from “minilanguages” that serve specific niches: it’s not possible to cater to all possible DSLs in base julia syntax anyway, DSLs are free to use macros exactly for this purpose.

Topic		Replies	Views
Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language New to Julia question , dataframes	22	702	December 23, 2022
How to specify a transformation with multiple arguments via minilanguage General Usage question , dataframes	5	270	October 31, 2022
Transform in DataFrames General Usage dataframes	13	433	January 21, 2024
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	569	December 7, 2022
Apply function By Row without re-stating column names General Usage dataframes , functions	36	3483	May 9, 2022

Output to multiple target columns via transform in DataFrames Minilanguage

Related topics