Broadcast transformed data from single row to multiple columns

daharn · December 5, 2022, 7:34pm

Suppose I have a DataFrame like this

df = DataFrame("a" => [1,2], "b" => [3,4])
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4

and I want to calculate several new columns using the function

function my_weird_fun(x,y)
    return x+y, x-y, x*y, x/y
end

I can do this

transform(data, [:a,:b]  => ByRow(my_weird_fun) => [:plus,:minus,:times,:div])
2×6 DataFrame
 Row │ a      b      plus   minus  times  div      
     │ Int64  Int64  Int64  Int64  Int64  Float64  
─────┼─────────────────────────────────────────────
   1 │     1      3      4     -2      3  0.333333
   2 │     2      4      6     -2      8  0.5

But what if I only want to use the first row of a and b and broadcast the result to all rows of the new columns, so that the result is

2×6 DataFrame
 Row │ a      b      plus   minus  times  div      
     │ Int64  Int64  Int64  Int64  Int64  Float64  
─────┼─────────────────────────────────────────────
   1 │     1      3      4     -2      3  0.333333
   2 │     2      4      4     -2      3  0.333333

Is there a way to achieve this using transform?

EDIT: This works but feels rather hacky:

transform(df, :a => ByRow(x->my_weird_fun(df.a[1],df.b[1])) => [:plus,:minus,:times,:div])

Is there a way I can achieve this without direct reference to df which might come out of a @chain?

bkamins · December 5, 2022, 8:00pm

An ugly solution is:

julia> transform(df, [:a,:b]  => ((x, y) -> fill(my_weird_fun(first(x), first(y)), length(x))) => [:plus,:minus,:times,:div])
2×6 DataFrame
 Row │ a      b      plus   minus  times  div
     │ Int64  Int64  Int64  Int64  Int64  Float64
─────┼─────────────────────────────────────────────
   1 │     1      3      4     -2      3  0.333333
   2 │     2      4      4     -2      3  0.333333

which is designed to re-use my_weird_fun. I assume this is what you want.

An alternative would be e.g.:

transform(df, [[:a,:b]]  .=> (Base.splat.([+, -, *, /]) .∘ (x...) -> getindex.(x, 1)) .=> [:plus, :minus, :times, :div])

The key challenge here is the second part i.e. forcing to use only the first element in computation.

daharn · December 5, 2022, 8:06pm

Thanks for the quick reply!
The first solution is probably going to be my workaround.

The background is that I am actually splitting a large Dataframe via groupby and want to apply the function once for each group (because it’s rather expensive) and then fill the results for all rows of each group. I could probably do this with loops as well, only thought there might be a solution within the transform syntax.

EDIT: I just realized that this rules out any use of ByRow, so my hacky solution is not very useful to begin with.

bkamins · December 5, 2022, 8:13pm

Then you should not use ByRow. But rather write a function in a way that it is aware that it will get vectors as input.

rocco_sprmnt21 · December 5, 2022, 8:30pm

transform(df, Returns((;zip([:plus,:minus,:times,:div],my_weird_fun(df[1,:a], df[1,:b]))...)))

bkamins · December 5, 2022, 8:32pm

Right - as I have commented above I would put something like this in a separate function

rocco_sprmnt21 · December 5, 2022, 8:47pm


df = DataFrame("gr"=> rand(1:10,20),"a" => rand(1:10,20), "b" => rand(1:10,20))
function my_weird_fun(x,y)
    return x+y, x-y, x*y, x/y
end

t(df)=transform(df, Returns((;zip([:plus,:minus,:times,:div],my_weird_fun(df[1,:a], df[1,:b]))...)))

combine(t,groupby(df,:gr))
julia> combine(t,groupby(df,:gr))
20×7 DataFrame
 Row │ gr     a      b      plus   minus  times  div     
     │ Int64  Int64  Int64  Int64  Int64  Int64  Float64
─────┼───────────────────────────────────────────────────
   1 │     3     10      6     16      4     60  1.66667
   2 │     3      2     10     16      4     60  1.66667
   3 │     3      7      6     16      4     60  1.66667
   4 │     4      4      2      6      2      8  2.0
   5 │     4      9      6      6      2      8  2.0
   6 │     4      2      7      6      2      8  2.0
   7 │     4      7      1      6      2      8  2.0
   8 │     4      6      8      6      2      8  2.0
   9 │     5      3      1      4      2      3  3.0
  10 │     5      5     10      4      2      3  3.0
  11 │     5      3      3      4      2      3  3.0
  12 │     5      6      1      4      2      3  3.0
  13 │     6      8      5     13      3     40  1.6
  14 │     7      7      4     11      3     28  1.75
  15 │     7      5      6     11      3     28  1.75
  16 │     7      3      4     11      3     28  1.75
  17 │     7      7      3     11      3     28  1.75
  18 │     8      2      4      6     -2      8  0.5
  19 │     8      1      7      6     -2      8  0.5
  20 │    10      6      1      7      5      6  6.0

rocco_sprmnt21 · December 6, 2022, 3:29pm

bkamins:

An alternative would be e.g.:
transform(df, [[:a,:b]]  .=> (Base.splat.([+, -, *, /]) .∘ (x...) -> getindex.(x, 1)) .=> [:plus, :minus, :times, :div])
The key challenge here is the second part i.e. forcing to use only the first element in computation.

In an attempt to understand how this phantasmagorical expression worked, I have produced the following versions which may make it easier for someone how things are.

transform(df, [[:a,:b]]  .=> ( map(f->Base.splat(f)∘((x...)->getindex.(x, 1)),[+, -, *, /])) .=> [:plus, :minus, :times, :div])

transform(df, [[:a,:b]] .=> [(x...)->sum(x), (x...)->(-)(x...), (x...)->.*(x...), (x...)->.\(x...)].=>[:plus, :minus, :times, :div])



transform(df, [[:a,:b]] .=> map(f->((x...)->f(x...)),[+, -, .*, .\]) .=>["a+b","a-b","a*b","a\b"] )

transform(df, [[:a,:b]] .=> map(f->((x...)->f(first.(x)...)),[+, -, *, \]) .=>["a+b","a-b","a*b","a\b"] )


transform(df, [:a,:b]=>( (x...)->(;zip([:plus,:minus,:times,:div], map(f->f(x...),[+, -, .*,.\]))...))=>AsTable)

Dan · December 6, 2022, 5:21pm

Here is another option which doesn’t use transform, but looks simple enough.

First, make my_weird_fun return NamedTuple:

function my_weird_fun(x,y)
    return (plus=x+y, minus=x-y, times=x*y, div=x/y)
end

Then:

df = DataFrame("a" => [1,2], "b" => [3,4])

df |> (d -> foreach(
              p->d[!,p[1]].=p[2], 
              pairs(my_weird_fun(first.(d[!,f] for f in [:a,:b])...))
      ))

Givinig:

julia> df
2×6 DataFrame
 Row │ a      b      plus   minus  times  div      
     │ Int64  Int64  Int64  Int64  Int64  Float64  
─────┼─────────────────────────────────────────────
   1 │     1      3      4     -2      3  0.333333
   2 │     2      4      4     -2      3  0.333333

ADDENDUM:
If the goal is to treat groups of a DataFrame separately, this can be done with this method as follows:

df = DataFrame(:a => [1,2,3], :b => [4,5,6], :c => [7,7,8])

foreach(d -> foreach(p->d[!,p[1]].=p[2], 
  pairs(my_weird_fun(first.(d[!,f] for f in [:a,:b])...))
  ), groupby(df, :c))

Now,

julia> df
3×7 DataFrame
 Row │ a      b      c      plus    minus   times   div      
     │ Int64  Int64  Int64  Int64?  Int64?  Int64?  Float64? 
─────┼───────────────────────────────────────────────────────
   1 │     1      4      7       5      -3       4      0.25
   2 │     2      5      7       5      -3       4      0.25
   3 │     3      6      8       9      -3      18      0.5

POST-ADDENDUM:
Keen-eyed readers will note the added columns are pesky unions with Missing even though nothing is missing. This can be fixed by iterating on groups cleverly:

foreach(d -> foreach(p->d[!,p[1]].=p[2], 
    pairs(my_weird_fun(first.(d[!,f] for f in [:a,:b])...))
  ), Iterators.flatten([(df,),Iterators.drop(groupby(df, :c),1)]))

daharn · December 6, 2022, 11:12pm

Thanks for all the nice suggestions, even though some of them appear quite arcane.

I picked a bit of each to come up with something that does the job while still looking relatively clean:

using DataFrames

df = DataFrame("gr"=> rand(1:10,20),"a" => rand(1:10,20), "b" => rand(1:10,20))

my_weird_fun(x,y) = (plus=x+y, minus=x-y, times=x*y, div=x/y)

for g in groupby(df,:gr)
    transform!(g, Returns(my_weird_fun(g.a[1],g.b[1])))
end

Except for the “pesky unions” already mentioned by @Dan, any more disadvantages here?

Dan · December 6, 2022, 11:53pm

Just replace groupby(df,:gr) with Iterators.flatten([(df,),groupby(df, :gr)]). It should get rid of the pesky unions (with some overhead, of course)

rocco_sprmnt21 · December 7, 2022, 7:46am

Using the following expressions stays within the mini-language and doesn’t present the problem of “pesky” columns.
about being more or less clear, it can be debated

julia> using DataFrames

julia> df = DataFrame("gr"=> rand(1:10,20),"a" => rand(1:10,20), "b" => rand(1:10,20))
20×3 DataFrame
 

julia> function mwf(x,y)
           return (plus=x+y, minus=x-y, times=x*y, div=x/y)
       end
mwf (generic function with 1 method)

julia> t(gdf)=transform(gdf, x->mwf(gdf[1,:a],gdf[1,:b]))   
t (generic function with 1 method)

julia> combine(t,groupby(df,:gr))
20×7 DataFrame
 Row │ gr     a      b      plus   minus  times  div      
     │ Int64  Int64  Int64  Int64  Int64  Int64  Float64  
─────┼────────────────────────────────────────────────────
   1 │     1      4      3      7      1     12   1.33333
   2 │     1      3      1      7      1     12   1.33333
   3 │     2      8     10     18     -2     80   0.8
   4 │     2      5      5     18     -2     80   0.8       
   5 │     3     10      1     11      9     10  10.0       
   6 │     4      9      7     16      2     63   1.28571   
   7 │     4     10      1     16      2     63   1.28571   
   8 │     4      7      8     16      2     63   1.28571   
   9 │     5      7      7     14      0     49   1.0       
  10 │     5      8      2     14      0     49   1.0       
  11 │     5      9      1     14      0     49   1.0       
  12 │     6      6      3      9      3     18   2.0       
  13 │     6      2      7      9      3     18   2.0       
  14 │     7      7      5     12      2     35   1.4       
  15 │     8      9      2     11      7     18   4.5       
  16 │     8      5      2     11      7     18   4.5       
  17 │     8      2      7     11      7     18   4.5       
  18 │     8      8      3     11      7     18   4.5       
  19 │     9      1      2      3     -1      2   0.5       
  20 │     9     10      7      3     -1      2   0.5

daharn · December 7, 2022, 12:18pm

Thanks. Toying around with this, I found that the source of the Union types is actually the fact that my version with the loop works in-place. Replacing transform with transform! in your example leads to the same result.

Another thing to consider is that my version, being an in-place operation, keeps the exact order of df while using combine on the GroupedDataFrame will order the resulting (new) DataFrame according to the groups.

EDIT:

Alright, my final version now looks like this:

using DataFrames
using DataFramesMeta

df = DataFrame("gr"=> rand(1:10,20),"a" => rand(1:10,20), "b" => rand(1:10,20))

function my_weird_fun(df) 
    x,y = @with df :a[1],:b[1]
    (plus=x+y, minus=x-y, times=x*y, div=x/y)
end

transform!(groupby(df,:gr),my_weird_fun)

@with in this case is obviously not needed, but in case you have a lot of input columns its clearer this way.

rocco_sprmnt21 · December 7, 2022, 1:27pm

or in the following form without having to rely on packages other than DataFrames.

function mwf(gdf,cols)
    fr=gdf[1,cols]
    return (plus=+(fr...), times=.*(fr...))
end

transform(g->mwf(g,[:a,:b]),groupby(df,:gr))

Topic		Replies	Views
Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl New to Julia dataframes	9	1061	May 22, 2021
Transform multiple columns of a DataFrame using the same function Data dataframes	12	4161	January 23, 2023
Transform DataFrame into a single row by adding row contents as new columns New to Julia dataframes	3	314	May 1, 2021
Dataframe transform operation on multiple columns General Usage dataframes	10	4192	August 8, 2020
Dataframes transform! General Usage dataframes	8	1372	July 11, 2022

Broadcast transformed data from single row to multiple columns

Related topics