Prefixing output columns which are returned as Dataframe

mreichMPI-BGC · April 29, 2023, 6:56pm

Sorry, I am stuck a bit…: Suppose I have a function that takes Vectors and returns a DataFrame of the same length as the Vectors (e.g. my calc function below).

function calc(x1,x2) 
    y1 = accumulate(+, x1+x2 .+ 0.1rand())
    y2 = accumulate(*, y1)
    y3 = accumulate(+, y2 ./ x1)
    DataFrame(;y1, y2, y3)
end

Then I can use => AsTable to get the names of the calculated DataFrame as new columns names. How can I programmatically add a suffix, to distinguish repeated calculations. I came up with

d = DataFrame(x1=rand(100), x2=rand(100))

#transform(d, [:x1, :x2] => ((x,y) -> calc(x,y)) => AsTable) # Standard way
#transform(d, [:x1, :x2] => ((x,y) -> calc(x,y)) => ["y1","y2","y3"] .* "_mycalc") # manually
transform(d, [:x1, :x2] => ((x,y) -> calc(x,y)) => names(calc(rand(2), rand(2))) .* "_mycalc") # The result I want

This is rather a workaround (dummy-calling the function)… Is there a better way to access the names?
I don’t manage to do this at all with DataFrameMacros which I do like very much.

Thanks for any help!

bkamins · April 29, 2023, 8:43pm

Currently the output column names are computed dynamically only if you use AsTable as target. In other cases they are computed statically (i.e. before the transformation function is called). So currently the only way to do it would be to define calc differently. Eg. like this:

julia> function calc2(suffix)
           return function(x1, x2)
               y1 = accumulate(+, x1+x2 .+ 0.1rand())
               y2 = accumulate(*, y1)
               y3 = accumulate(+, y2 ./ x1)
               return DataFrame(Any[y1, y2, y3], string.("y", 1:3, suffix))
           end
       end
calc2 (generic function with 1 method)

julia> transform(d, [:x1, :x2] => calc2("_mycalc") => AsTable)
100×5 DataFrame
 Row │ x1          x2          y1_mycalc  y2_mycalc        y3_mycalc
     │ Float64     Float64     Float64    Float64          Float64
─────┼─────────────────────────────────────────────────────────────────────
   1 │ 0.0344154   0.435705      0.54877      0.54877         15.9455
   2 │ 0.35864     0.116174      1.10223      0.604873        17.6321
   3 │ 0.945627    0.16657       2.29308      1.38702         19.0988
   4 │ 0.489643    0.145826      3.0072       4.17106         27.6174

mreichMPI-BGC · April 29, 2023, 8:59pm

Thanks for the explanation and hint!
Probably then I like more doing like the following, not having to the change the original function:

addSuffix(df::DataFrame, s) = rename(df, names(df) .* s)
transform(d, [:x1, :x2] => ((x,y) -> addSuffix(calc(x,y), "_mycalc" )) => AsTable) # Standard way
#or
transform(d, [:x1, :x2] => ((x,y) -> @chain calc(x,y) addSuffix("_mycalc")) => AsTable) # with @chain

Still would be happy to hear about DataFrameMacros solution… @jules

rafael.guerra · April 30, 2023, 8:44am

Would it be an option for calc() to return a named tuple instead?
Then the code could be simplified:

function calc(x1,x2) 
    y1 = accumulate(+, x1+x2 .+ 0.1rand())
    y2 = accumulate(*, y1)
    y3 = accumulate(+, y2 ./ x1)
    return (y1_calc=y1, y2_calc=y2, y3_calc=y3)
end

d = DataFrame(x1=rand(100), x2=rand(100))

hcat(d, DataFrame(calc(d.x1,d.x2)))

mreichMPI-BGC · April 30, 2023, 8:49am

Thanks, but that would not programmatically allow to add the suffix. Otherwise, a good “design” question. I thought I wanted to stay within the well-defined DataFrame API framework. But I am happy about suggestions.
The use case is running dynamic models with several states forced by variables in a DataFrame.

rafael.guerra · April 30, 2023, 9:08am

Julia is the mother API, but with the mini-language perhaps you could then do instead:

transform(d, [:x1, :x2] => ((x1,x2) -> calc(x1, x2)) => AsTable)

mreichMPI-BGC · April 30, 2023, 9:11am

Which is what “we” do above… But your solution does not allow the programmatic addition of the suffix. That’s why @bkamins and I had solutions with suffix as function parameter.

jules · April 30, 2023, 9:25am

The base operation is

@transform(df, AsTable = @bycol calc(:x1, :x2))

but how AsTable works here cannot be modified due to the aforementioned limitations in the dispatch structure of the mini language. This is not something where DataFrameMacros could add convenience on top.

jules · April 30, 2023, 9:31am

Another option to do it with base DataFrames, and without adding a suffix parameter to calc:

julia> suffixer(s) = df -> rename(n -> n * s, df)
suffixer (generic function with 1 method)

julia> transform(df, [:x1, :x2] => suffixer("_suf") ∘ calc => AsTable)
100×5 DataFrame
 Row │ x1          x2         y1_suf     y2_suf           y3_suf
     │ Float64     Float64    Float64    Float64          Float64
─────┼────────────────────────────────────────────────────────────────────
   1 │ 0.708036    0.975863     1.76416      1.76416          2.49162
   2 │ 0.684251    0.0320959    2.56076      4.51759          9.09386
   3 │ 0.334323    0.49034      3.46568     15.6565          55.9244
   4 │ 0.0376339   0.699219     4.28279     67.0537        1837.66
   5 │ 0.85722     0.0254397    5.24571    351.744         2247.99
   6 │ 0.650973    0.426887     6.40383   2252.51          5708.21
   7 │ 0.432189    0.721066     7.63734  17203.2          45513.0
   8 │ 0.323039    0.221529     8.26217      1.42136e5        4.85509e5

This could be improved so it works not just with dataframes as intermediate results, but also namedtuples, e.g.

rafael.guerra · April 30, 2023, 9:36am

Based on a master’s solution here, I have adapted the code to handle it:

using DataFrames

function calc(x1,x2, suffix)
  y1 = accumulate(+, x1+x2 .+ 0.1rand())
  y2 = accumulate(*, y1)
  y3 = accumulate(+, y2 ./ x1)
  mynames = Symbol.((:y1, :y2, :y3), "_$suffix")
  return (;zip(mynames, (y1,y2,y3))...)
end

d = DataFrame(x1=rand(100), x2=rand(100))

hcat(d, DataFrame(calc(d.x1, d.x2, "mycalc")))

Topic		Replies	Views
Apply function By Row without re-stating column names General Usage dataframes , functions	36	3494	May 9, 2022
Efficient way to add column to dataframe computed from prior columns New to Julia question	6	2309	August 12, 2021
Where is the input column name information in a dataframe transformation? Data dataframes	9	477	December 22, 2022
DataFrame transform with many output columns General Usage dataframes	1	275	March 28, 2022
DataFrame column names with symbols General Usage dataframes	7	1164	July 13, 2021

Prefixing output columns which are returned as Dataframe

Related topics