Output to multiple target columns via transform in DataFrames Minilanguage

Suppose I have some function fun that takes in multiple columns of a GroupedDataFrame and outputs a series of calculations with additional parameters. I would like to output these calculations to new columns of the GroupedDataFrame via transform so that it would look something like this.

transform!(GDF, [:A, :B ] => (a, b ) -> fun(a,b, p1,p2) => [:NewA, :NewB:, NewAB]

What type of output does fun need to return so that the calculations may output to the newly named columns?

after reading bkamins very informative post on DataFrames minilanguage https://bkamins.github.io/julialang/2020/12/24/minilanguage.html

I tried tried having the function return a tuple

function fun(A,B, p1,p2)
    ...
    return (NewA, NewB, NewAB)

named tuple

   return NewCols = (A = NewA , B = NewB, AB = NewAB)

Vector

    return [NewA, NewB, NewAB]

However the output always gets bunched together into an auto-named single column once it is used in transform! When I tried returning a DataFrame

   return NewCols = DataFrame(A = NewA , B = NewB, AB = NewAB)

The entire DataFrame was outputted to the newly generated column where each row of the newly generated column was a copy of the entire DataFrame.

I am not sure if my mistake is the format of what fun is returning or if it is how I have the mini language written? Any pointers on where I am going wrong would be greatly appreciated! Thanks!

The DataFrames minilanguage provides AsTable for this purpose (see the Multiple Target Columns section again).

A concrete example:

using RDatasets
using DataFrames

gdf = groupby(dataset("datasets", "iris"), :Species)

function fun(A, B)
	return (SepalMin = min.(A, B), SepalMax = max.(A, B))
end

transform(gdf, [:SepalLength, :SepalWidth] => fun => AsTable)

Alternatively, you can use ByRow on the function and specify multiple output columns. In this case, the function acts on each row separately (does not need to be vectorized) and can return any interable (here, a plain tuple).

transform(gdf,
    [:SepalLength, :SepalWidth] =>
    ByRow((a, b) -> (min(a, b), max(a, b))) =>
    [:SepalMin, :SepalMax]
)
2 Likes

Thanks! Yes I tried using AsTable at first with the following format.

transform(GDF, [:A, :B] => (x,y) -> fun(x, y, p1,p2) => AsTable)  

But, as with the named tuple output, transform ended up grouping the output into a single column.

My understanding from the post is that output that is either a AbstractDataFrame, DataFrameRow, AbstractMatrix or NamedTuple, should output to multiple target columns? So if my function is returning one of those values shouldnโ€™t I be able to output to multiple columns via transform?

Thanks also for the ByRow option. Iโ€™ll look into it further, the issue is that the function uses input from more than one row of a given column in calculating the output so having the function output the entire column seems like it might be more efficient compared to iterating ByRow of the DataFrame?

Let me write what I understand you want. But if you want something else please comment:

julia> gdf = groupby(DataFrame(id=1:3, A=11:13, B=101:103), :id)
GroupedDataFrame with 3 groups based on key: id
First Group (1 row): id = 1
 Row โ”‚ id     A      B
     โ”‚ Int64  Int64  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1     11    101
โ‹ฎ
Last Group (1 row): id = 3
 Row โ”‚ id     A      B
     โ”‚ Int64  Int64  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     3     13    103

julia> fun(a, b, p1, p2) = (x=a*p1, y=b*p2, z=(a+b)*p1*p2)
fun (generic function with 1 method)

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun(a,b, p1,p2)) => AsTable)
3ร—6 DataFrame
 Row โ”‚ id     A      B      x      y       z
     โ”‚ Int64  Int64  Int64  Int64  Int64   Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1     11    101    110  101000  1120000
   2 โ”‚     2     12    102    120  102000  1140000
   3 โ”‚     3     13    103    130  103000  1160000

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun(a,b, p1,p2)) => [:NewA, :NewB, :NewAB])
3ร—6 DataFrame
 Row โ”‚ id     A      B      NewA   NewB    NewAB
     โ”‚ Int64  Int64  Int64  Int64  Int64   Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1     11    101    110  101000  1120000
   2 โ”‚     2     12    102    120  102000  1140000
   3 โ”‚     3     13    103    130  103000  1160000

and a second example with matrix:

julia> fun2(a, b, p1, p2) = [a*p1 b*p2 (a+b)*p1*p2]
fun2 (generic function with 1 method)

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun2(a,b, p1,p2)) => [:NewA, :NewB, :NewAB])
3ร—6 DataFrame
 Row โ”‚ id     A      B      NewA   NewB    NewAB
     โ”‚ Int64  Int64  Int64  Int64  Int64   Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1     11    101    110  101000  1120000
   2 โ”‚     2     12    102    120  102000  1140000
   3 โ”‚     3     13    103    130  103000  1160000

julia> transform(gdf, [:A, :B ] => ((a, b ) -> fun2(a,b, p1,p2)) => AsTable)
3ร—6 DataFrame
 Row โ”‚ id     A      B      x1     x2      x3
     โ”‚ Int64  Int64  Int64  Int64  Int64   Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1     11    101    110  101000  1120000
   2 โ”‚     2     12    102    120  102000  1140000
   3 โ”‚     3     13    103    130  103000  1160000
2 Likes

Thanks so much for taking the time to make this detailed example! It was instrumental in illustrating my mistake. I thought the issue was that transform wouldnโ€™t work if an array was passed as a parameter into fun without some modification to the output type but it all works fine now with the template you provided.

In the off chance it will help other novices

I had incorrectly used

transform(gdf, [:A, :B ] => (a, b ) -> fun(a,b, p1,p2) => [:NewA, :NewB, :NewAB])

as opposed to

transform(gdf, [:A, :B ] => ((a, b ) -> fun(a,b, p1,p2)) => [:NewA, :NewB, :NewAB])

missing the () around (a, b ) -> fun(a,b, p1,p2) put the output of fun into a single column of the GroupedDataFrame.

2 Likes

Yes - the ( and ) are needed because of operator precedence rules in Julia.

1 Like

got it! appreciate the clarification.

This is a confusion that very often occurs even to those who know the rule.
I wonder if there isnโ€™t a way to change the precedence rules locally to the minilanguage (still using the same โ€œ->โ€ symbols for the function and โ€œ=>โ€ for the pairs).
Obviously in that case it would be necessary to use the parentheses in case you want the pairs as output of the function. But being an explicit choice it would be less prone to โ€œdistractionsโ€.

1 Like

I do not think it is technically possible unfortunately.

precedence of -> is surprisingly low sometimes. this is also an issue when combining with |>

In the global context of Julia, what are the advantages (and therefore the contraindications to any changes) of the current rules of precedence?

since we have โ€œ->โ€ with lower precedence than โ€œ=>โ€, what do you think of a super function indicated with โ€œ+>โ€ :), to use in case you want a different order from the standard?

Irrespective of the merits, this would be a pretty massively breaking change to the parser affecting loads of Julia code (whether in the context of DataFrames or not) and so wonโ€™t happen for 1.0

Also a key design goal for DataFrames is consistency with base Julia, so having different operator precedence context dependent within DataFrames runs counter to that. This is the domain of macros, where the @ clearly signals โ€œthis bit of code does not mean what you might think it means!โ€

ok.
What I canโ€™t evaluate is the reasons WHY โ€œ=>โ€ has higher precedence than โ€œ->โ€ and therefore, in addition to compatibility problems with already written code, what would be the contraindications to changing the order of precedence.
In the context of the minilanguage I would say that in the vast majority of cases (99%?) the opposite of the current one would be preferable.

I would also say that in the case of the mini-language no solution (whether macro or otherwise) that requires explicit user intervention would be really useful, as once the user realizes the precedence problem, he just encloses it in parentheses " (", โ€œ)โ€ the function.

In general this was discussed some time ago, but AFAICT:

  1. => has to have lower precedence than e.g. =
  2. = has to have lower precedence than ->

A complete example involving all (artificial but showing the issue) is:

julia> map(x -> x = 10 => 20, 1:2)
2-element Vector{Pair{Int64, Int64}}:
 10 => 20
 10 => 20

where you clearly do not want this to be parsed as:

julia> map(((x -> x) = 10) => 20, 1:2)
ERROR: syntax: invalid assignment location "x -> begin
3 Likes

So the fact that => has lower(*) precedence than โ†’ comes from the transitivity of precedence ordering?

(*)
In my own language I would have used the term โ€œhigher precedenceโ€ in these cases, if I havenโ€™t misunderstood something.

I mixed up โ€œlowerโ€ with โ€œhigherโ€ in my comment - you are right.

Donโ€™t think transitivity is required here. I (and likely many others) definitely prefer x -> first(x) => length(x) to parse as a function, as it does now.

I fail to assess the practical relevance of such an example.
I would expect more of a situation like this

[first(x) => length(x) for x in vecofvec]

first and length are just arbitrary examples, no specific meaning attached here.

Itโ€™s just that functional forms are more easily composable than comprehensions, so Iโ€™m basically talking about stuff like map(x -> x.a => x.b, X). Pretty useful when creating dictionaries, among other things.

And what are common usecases when youโ€™d like the opposite precedence order? Aside from โ€œminilanguagesโ€ that serve specific niches: itโ€™s not possible to cater to all possible DSLs in base julia syntax anyway, DSLs are free to use macros exactly for this purpose.