Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language

I’m probably missing something very basic in the syntax but if I have a GroupedDataFrame with n columns. (I’ll set n = 4 to illustrate the issue but for the actual use case suppose n >> 4)

gdf = groupby(DataFrame(id=1:3, A=11:13, B=101:103, C = 25:27, D = 32:34), :id)

GroupedDataFrame with 3 groups based on key: id
First Group (1 row): id = 1
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     1     11    101     25     32
⋮
Last Group (1 row): id = 3
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     3     13    103     27     34

Suppose fun runs some calculation on any pair of given columns.

function fun( col1, col2)
    return @. (col1 +5) * (col2 +5) 
end 

What is the best way to run function fun on all combinations of column [:id] and columns [:A, :B, :C, :D.] via the DataFrame mini language. For a single set of two columns I could do:

transform!(gdf, [:id,:A] => ((x,y)->fun(x,y))=> :Aid)

But I’m not sure how to iterate this over the remaining pairs of columns [:id, :B], [:id, C] [: id, :D] because I’m not sure how to convert type String into a DataFrame column

Creating a set of column names as strings such as set = [ ":A", ":B", ":C", ":D"] and then looping over clearly won’t work as

for I = set
    transform!( gdf, [id,I] =>fun)
end

ERROR: ArgumentError: mixing `Symbol`s with other selectors is not allowed

So I tried

set = Iterators.product(["id"], names(gdf))

But I was unable to figure out how to remove (“id”, “id”) from a Base.Iterators type. (is this possible?) Instead I used

set = collect(Iterators.product(["id"], names(gdf)) 
set = filter(x -> x != ("id","id"), set)

however the following syntax results in a MethodError.

for x = set
    transform!(gdf, names(gdf, x)  =>  fun)
end

So I am curious what is the correct way to input the iterator into names or is there a better approach in general?

As a clumsier option I thought I could name the columns such that

gdf2 = groupby(DataFrame(id1234 = 1:3, A1 = 11:13, B2 = 101:103, C3 = 25:27, D4 = 32:34), :id1234)

then try something like

for I = 1:4
    transform!(gdf2, names(gdf2, r"I") => fun)
end

But this also returned a MethodError Any tips would be greatly appreciated, thanks!

If I understand correctly, you want a result like this (without the D column, for terseness):

groupby(DataFrame(id=1:3, A=11:13, B=101:103, C=25:27), :id);

transform(gdf, [:id, :A] => fun, [:id, :B] => fun, [:id, :C] => fun)

# Output:
3×7 DataFrame
 Row │ id     A      B      C      id_A_fun  id_B_fun  id_C_fun 
     │ Int64  Int64  Int64  Int64  Int64     Int64     Int64    
─────┼──────────────────────────────────────────────────────────
   1 │     1     11    101     25        96       636       180
   2 │     2     12    102     26       119       749       217
   3 │     3     13    103     27       144       864       256

So the question is, how to generate easily the part [:id, :A] => fun, [:id, :B] => fun, [:id, :C] => fun.

First, note that DataFrames.jl also accepts these pairs wrapped together in a single Vector argument:

julia> transform(gdf, [[:id, :A] => fun, [:id, :B] => fun, [:id, :C] => fun])
3×7 DataFrame
 Row │ id     A      B      C      id_A_fun  id_B_fun  id_C_fun 
     │ Int64  Int64  Int64  Int64  Int64     Int64     Int64    
─────┼──────────────────────────────────────────────────────────
   1 │     1     11    101     25        96       636       180
   2 │     2     12    102     26       119       749       217
   3 │     3     13    103     27       144       864       256

So how can we generate this vector [[:id, :A] => fun, [:id, :B] => fun, [:id, :C] => fun]? One way is to use broadcasting like this:

julia> [[:id, :A], [:id, :B], [:id, :C]] .=> fun
3-element Vector{Pair{Vector{Symbol}, typeof(fun)}}:
 [:id, :A] => fun
 [:id, :B] => fun
 [:id, :C] => fun

The next step is to generate [[:id, :A], [:id, :B], [:id, :C]] also with broadcasting:

julia> vcat.(:id, [:A, :B, :C, :D])
4-element Vector{Vector{Symbol}}:
 [:id, :A]
 [:id, :B]
 [:id, :C]
 [:id, :D]

Putting all these together gives

julia> transform(gdf, (vcat.(:id, [:A, :B, :C]) .=> fun))
3×7 DataFrame
 Row │ id     A      B      C      id_A_fun  id_B_fun  id_C_fun 
     │ Int64  Int64  Int64  Int64  Int64     Int64     Int64    
─────┼──────────────────────────────────────────────────────────
   1 │     1     11    101     25        96       636       180
   2 │     2     12    102     26       119       749       217
   3 │     3     13    103     27       144       864       256
1 Like

Actually a better, simpler way is with array comprehension:

julia> transform(gdf, [[:id, col] => fun for col in [:A, :B, :C]])
3×7 DataFrame
 Row │ id     A      B      C      id_A_fun  id_B_fun  id_C_fun 
     │ Int64  Int64  Int64  Int64  Int64     Int64     Int64    
─────┼──────────────────────────────────────────────────────────
   1 │     1     11    101     25        96       636       180
   2 │     2     12    102     26       119       749       217
   3 │     3     13    103     27       144       864       256
2 Likes

If you’re interested in using a macro convenience package, https://github.com/jkrumbiegel/DataFrameMacros.jl makes such multi-column operations simpler to write.

Every symbol like :id or expression within { } is understood as one or many columns. The whole function expression is then broadcast over all of these collections of columns. In your case that could be one of the following options:

using DataFrameMacros
using DataFrames

df = DataFrame(id=1:3, A=11:13, B=101:103, C = 25:27, D = 32:34)
gdf = groupby(df, :id)

@transform!(gdf, :id + {[:A, :B, :C, :D]})
@transform!(gdf, :id + {Between(:A, :D)})
@transform!(gdf, :id + {Not(:id)})
@transform!(gdf, :id + {r"[ABCD]"})

The nice thing is that DataFrameMacros handles the conversion to the same vectors of strings that @sijo was constructing manually via vcat.( for you. You can use every format that you’d normally use as selector in names(df, selector).

For a little more convenience, you can quickly name your new columns with a shortcut syntax:

Compare

julia> @transform!(gdf, :id + {[:A, :B, :C, :D]})
3×9 DataFrame
 Row │ id     A      B      C      D      id_A_+  id_B_+  id_C_+  id_D_+ 
     │ Int64  Int64  Int64  Int64  Int64  Int64   Int64   Int64   Int64  
─────┼───────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32      12     102      26      33
   2 │     2     12    102     26     33      14     104      28      35
   3 │     3     13    103     27     34      16     106      30      37

with

julia> @transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})
3×9 DataFrame
 Row │ id     A      B      C      D      id_plus_A  id_plus_B  id_plus_C  id_plus_D 
     │ Int64  Int64  Int64  Int64  Int64  Int64      Int64      Int64      Int64     
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32         12        102         26         33
   2 │     2     12    102     26     33         14        104         28         35
   3 │     3     13    103     27     34         16        106         30         37
3 Likes

another way could be to move the “complexity” from the mini-language into the function

gdf = groupby(DataFrame(id=1:3, A=11:13, B=101:103, C = 25:27, D = 32:34), :id)

funny( col1, cols...)=mapreduce(c->(col1 .+5).*(c.+5), hcat, cols)


transform(gdf, Cols(:)=>funny=>[:a, :b, :c, :d])

transform(gdf, Cols(:)=>funny=>AsTable)

transform(gdf, Cols(:)=>funny=>"new_".*names(gdf)[2:5])
1 Like

Hi thanks so much for this detailed example! It really helped me think through the problem. Sorry it has taken me so long to work through this.

Is there a way to use the array comprehension method to include the naming of columns?

I ask because in trying to get the array comprehension to work I am getting something like the following error:


ERROR: ArgumentError: a single value or vector result is required (got NamedTuple{(:R, :S, :V, :P, :T, :J), NTuple{6, Vector{Float64}}})

In this case fun actually returns a NamedTuple to be stored in multiple columns of gdfso the minilanguage for a single transformation would actually look like.


transform!(gdf, [:id,:A] => ((x,y, p1, p2 )->fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

(Sorry in my inexperience in adhering to the posting guidelines I over simplified the original problem.)

if I try something like


transform(gdf, [[:id, col] => fun => AsTable for col in [:A, :B, :C]`])

I get an error like


ERROR: ArgumentError: duplicate output column name: :R

which I assume is because multiple instances of transform are taking place as a single instance like


transform(gdf, [:id, :A] => fun => AsTable)

does work as intended. I know @jules’s post touches on this with DataFrameMacros but was would outputting to individual named columns with just the mini language and array comprehension be possible?

Thanks so much this is really cool! Richard Sutton once said there is no such thing as a stupid question, so… I actually have 2 for you.

  1. Are there downsides in terms of stability or performance is using a macro convenience package over the mini language itself? It seems a little bit like those 0 calories sodas where you feel like there has to be a catch somewhere.

  2. Sorry I’m still a little unclear on the syntax here …so if fun itself has several parameters like

fun(col1, col2, p1, p2)

and the transform of a single instances would look like


transform!(gdf, [:id,:A] => ((x,y, p1, p2 )-> fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

would you very much illustrating how it would look in DataFrameMacros?

Thanks so much! I’m still trying to wrap my head around this approach. Is this approach limited to more straight forward calculations? Can it still work if there were additional parameters in the function and the output was a NamedTuple ?
e.g.

function fun(Col1,Col2, par1, par2, par3)
    (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]      
     # some looped calculation with if/then statements 
     return ( E = E, F= F, G=G)
end

If I understand the meaning correctly, the answer is yes.
And this should be the form that handles the output as a namedtuple.
If you provide even a minimal case of your situation we can attempt a test.

transform(gdf, Cols(:)=>funny=>AsTable)

A silly example follows

julia> function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

julia> combine(df,[:A,:B] => ((x,y)->funnt(x,y,1,2,3))=>AsTable)
3×3 DataFrame
 Row │ E        F        G       
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │   213.0   1236.0    123.0
   2 │   216.0   1352.0    126.0
   3 │   219.0   1470.0    129.0

julia> transform(df,[:A,:B] => ((x,y)->funnt(x,y,1,2,3))=>AsTable)
3×8 DataFrame
 Row │ id     A      B      C      D      E        F        G              
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64        
─────┼──────────────────────────────────────────────────────────────       
   1 │     1     11    101     25     32    213.0   1236.0    123.0        
   2 │     2     12    102     26     33    216.0   1352.0    126.0        
   3 │     3     13    103     27     34    219.0   1470.0    129.0   

in this form if the columns to be treated were many

julia> transform(df,[:A,:B] => ((x...)->funnt(x...,1,2,3))=>AsTable)       
3×8 DataFrame
 Row │ id     A      B      C      D      E        F        G              
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64        
─────┼──────────────────────────────────────────────────────────────       
   1 │     1     11    101     25     32    213.0   1236.0    123.0        
   2 │     2     12    102     26     33    216.0   1352.0    126.0        
   3 │     3     13    103     27     34    219.0   1470.0    129.0  
1 Like

There are no downsides for the meat of the transformation, which is running the transform function itself. The macro just exploits redundancies that are common in the specification of the mini-language, for example you have to specify input columns, but you also have to specify input arguments for the inner function which are always the same number as the input columns.

I have an unregistered package, GitHub - jkrumbiegel/PrettyMacroExpand.jl which is a bit nicer to use than @macroexpand to get a readable version of the code it outputs (it uses animal names for temporary macro-defined variables). Here’s a commented version of that, which will show you that at the core it’s not different than what you’d do manually, just with some bells and whistles to handle more complex cases:

using PrettyMacroExpand
using DataFrameMacros

@prettyexpand @transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})

DataFrameMacros.transform!(
    gdf,
    begin
        # store vector of column specifiers found in expression, converted to strings via `names(df, ...)`
        goshawk = ["id", DataFrameMacros.stringargs(Between(:A, :D), gdf)]

        # this is for the case where there are `{{ }}` expressions found, doesn't matter in this case
        wildebeest = DataFrameMacros.Tuple(
            DataFrameMacros.map(goshawk) do ibex
                ibex isa DataFrameMacros.Ref && return DataFrameMacros.Val(
                    DataFrameMacros.length(ibex[]),
                )
                return DataFrameMacros.nothing
            end,
        )

        # a reference to the function used, if this is not a simple call there would be
        # an anonymous function defined here
        squirrel = (+)

        # this is again for the case where there's a {{ }}, doesn't matter here, it will just return (+)
        crane = DataFrameMacros.makefunc(squirrel, wildebeest)

        # in DataFrameMacros, by default we use ByRow because I like it better
        finch = DataFrameMacros.ByRow(crane)

        # this is the function that executes the name transformation in "id_plus_{2}"
        butterfly = (
            (pheasant,) ->
                "id_plus_$(DataFrameMacros.getindex_colnames(pheasant, 2))"
        )

        # now we compute the output column names using that function and the input column names
        dotterel =
            butterfly.(
                vcat.("id", DataFrameMacros.stringargs(Between(:A, :D), gdf))
            )

        # the final mini-language construct is built, as you can see the input column vector is broadcasted
        # into pairs with the transformation function and the output columns, that's why
        # :id + {Between(:A, :D)} works.
        # In the object constructed by this call there are no inefficiencies compared to what you'd do manually.
        vcat.("id", DataFrameMacros.stringargs(Between(:A, :D), gdf)) .=>
            (finch .=> dotterel)
    end;
)

Looking at this, it is a little inefficient to look up the column names multiple times, but that is negligible compared to the cost of running the computation in most cases. I could also spend some time to improve the macro so it only computes the input column names once.

Regarding your second question, I don’t get what you’re trying to do here:

transform!(gdf, [:id,:A] => ((x,y, p1, p2 )-> fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

This wouldn’t work as there are just two input columns but four input arguments to the inner function and six output columns. Could you clarify what you’re trying to do, maybe with a working example I can translate to DataFrameMacros.jl syntax for you?

2 Likes

You can add => name inside of the array comprehension:

julia> transform(gdf, [[:id, col] => fun => "id_plus_$col" for col in [:A, :B, :C]])
3×7 DataFrame
 Row │ id     A      B      C      id_plus_A  id_plus_B  id_plus_C 
     │ Int64  Int64  Int64  Int64  Int64      Int64      Int64     
─────┼─────────────────────────────────────────────────────────────
   1 │     1     11    101     25         96        636        180
   2 │     2     12    102     26        119        749        217
   3 │     3     13    103     27        144        864        256

And here’s a variant using a function to generate the output column names based on the two input names:

make_name(cols) = "$(cols[1])_plus_$(cols[2])"  # or maybe nicer: string(cols[1], "_plus_", cols[2])

transform(gdf, [[:id, col] => fun => make_name for col in [:A, :B, :C]])

# Output
3×7 DataFrame
 Row │ id     A      B      C      id_plus_A  id_plus_B  id_plus_C 
     │ Int64  Int64  Int64  Int64  Int64      Int64      Int64     
─────┼─────────────────────────────────────────────────────────────
   1 │     1     11    101     25         96        636        180
   2 │     2     12    102     26        119        749        217
   3 │     3     13    103     27        144        864        256

As for having fun return the column names using a named tuple: that would be quite a different thing to do… In the above examples, fun is called separately on each column and it’s not clear why/how it should itself specify the output column name. Usually a named tuple is used when the processing function returns several columns (in this case it makes sense that the processing function would know how to name these columns). Anyway here’s one way to do in our case here:

# Same definition of fun as before
function fun( col1, col2)
    return @. (col1 +5) * (col2 +5) 
end

# Auxiliary function that gets the table from DataFrames
# and makes a table of fun results
function fun_on_table(t)
    # Make output names based on the column names
    names = Symbol.(keys(t)[1], "_plus_", keys(t)[2:end])
    # Call fun multiple times, passing first column and each of the others,
    # wrapping all outputs in a named tuple
    return NamedTuple(names .=> [fun(t[1], t[c]) for c in keys(t)[2:end]])
end

# Use like this:
transform(gdf, AsTable([:id, :A, :B, :C]) => fun_on_table => AsTable)

# Output:
3×7 DataFrame
 Row │ id     A      B      C      id_plus_A  id_plus_B  id_plus_C 
     │ Int64  Int64  Int64  Int64  Int64      Int64      Int64     
─────┼─────────────────────────────────────────────────────────────
   1 │     1     11    101     25         96        636        180
   2 │     2     12    102     26        119        749        217
   3 │     3     13    103     27        144        864        256

1 Like

Thanks so much for this really useful example. It’s exactly the issue I am having trouble with. (Sorry for the delay in response I just got done with exams and then got a bout of food poisoning. ) If you have time for a follow up, and I totally understand if you are busy due to the holidays. But the issue I am having is how do you then implement this function in your example

julia> function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

Into the mapreduce method you discussed when advising moving the complexity into the function. I don’t know if you would mind walking me through lines #2 and #3 of the code in your earlier example but if I understood it correctly.

# 1)
gdf = groupby(DataFrame(id=1:3, A=11:13, B=101:103, C = 25:27, D = 32:34), :id)

# 2) 
funny( col1, cols...)=mapreduce(c->(col1 .+5).*(c.+5), hcat, cols)

Here we are taking @. c->(col1+5)*(c+5) applying it to cols and then concatenating them horizontally. the ... operator means that there can be a variable amount of columns funny() applies to. So here

transform(gdf, Cols(:)=>funny=>[:a, :b, :c, :d])

funny() takes the :id as col1 and the remaining columns as the cols... argument. But how does this mapreduce method work with funnt() where we now have additional parameters and output vectors. Because in this format


transform(df,[:A,:B] => ((x,y)->funnt(x,y,1,2,3))=>AsTable)

I am not sure how to circumvent the issue of entering the columns individually that mapreduce() eliminated?

I really appreciate you sharing this! I’m still working through it but it does help me get a better sense of how macros work. (Sorry for the delay in response, getting over a bout of food poisoning. ) I guess the issue I am having and what I was trying to illustrate with

transform!(gdf, [:id,:A] => ((x,y, p1, p2 )-> fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

Is that the function fun itself takes as input two columns of the DataFrame but will output multiple columns to the DataFrame. So as per rocco_sprmnt21’s example where funnt outputs a NamedTuple where each vector of the NamedTuple is to be added to the DataFrame

function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

So I’m just not quite sure how to fit funnt into the DataFrameMacros format of something like

transform!(gdf, [:id,:A] => ((x,y)-> funnt(x,y,1,2,3 ))=> [:Aid1, :Aid2, :Aid3])
3×8 DataFrame
 Row │ id     A      B      C      D      Aid1     Aid2     Aid3    
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64 
─────┼──────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32     23.0     26.0     13.0
   2 │     2     12    102     26     33     26.0     42.0     16.0
   3 │     3     13    103     27     34     29.0     60.0     19.0

into the simpler DataFrameMacros format you provided for multiple columns of the DataFrame.

@transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})

I’ll understand if you are busy but in any case thanks again for taking the time in explaining all this.

In DataFrameMacros.jl, the left-hand side is usually used as is for the sink in DataFrames.jl’s source .=> function .=> sink minilanguage. Exceptions are the "{}_some_suffix" string syntax and when {} appears. So in order to work with a NamedTuple output from some function, instead of => AsTable in DataFrames, you do AsTable = .... If your function works only for vector arguments, you should use @bycol (only @combine uses that by default). In the example below, both versions work because broadcasting with scalars returns scalars:

julia> df = DataFrame(x = [1, 2, 3], y = [4, 5, 6])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> fun(x, y, i, j, k) = (; a = x .* i, b = y .* k, c = x .* y ./ i .* j .- k)
fun (generic function with 2 methods)

julia> df = DataFrame(x = [1, 2, 3], y = [4, 5, 6])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> @select df AsTable = fun(:x, :y, 4, 5, 6)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     4     24     -1.0
   2 │     8     30      6.5
   3 │    12     36     16.5

# in case your function only works on vectors
julia> @select df AsTable = @bycol fun(:x, :y, 4, 5, 6)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     4     24     -1.0
   2 │     8     30      6.5
   3 │    12     36     16.5
1 Like

Thanks so much for explaining this. This is really great and actually helps with several other issues I’ve been trying to sort out (Sorry for the delay in response, finished exams and then got food poisoning.) Would really appreciate if you would have time to clear up some issues I’m still having with the array comprehension method you described, but I totally understand if you are busy.

Sorry I didn’t make this clear in the follow up. In this case fun is returning multiple columns that would be added to the DataFrame. So again borrowing from rocco_sprmnt21’s example
if

function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

then the array comprehension of

transform(gdf, [[:id, col] => ((x,y) -> funnt(x,y,1,2,3)) => AsTable  for col in [:A, :B, :C]])

returns an error. So even if the function itself handled the naming element it seems like array comprehension wouldn’t work in this format.

transform(gdf, [[:id, col] => fun for col in [:A, :B, :C]]) 

(Because otherwise it should have auto named the columns with the AsTable function?). Just wondering if there a work around for a function that outputs multiple columns in the array comprehension?

Thanks! Sorry I’m not that quick on the update here but how would I combine this with the method of taking input multiple columns in your previous post?

so previously there was the format

@transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})
3×9 DataFrame
 Row │ id     A      B      C      D      id_plus_A  id_plus_B  id_plus_C  id_plus_D 
     │ Int64  Int64  Int64  Int64  Int64  Int64      Int64      Int64      Int64     
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32         12        102         26         33
   2 │     2     12    102     26     33         14        104         28         35
   3 │     3     13    103     27     34         16        106         30         37

so here if

df.z = [7,8,9]
df
3×4 DataFrame
 Row │ x      y      z          
     │ Int64  Int64  Int64  
─────┼────────────────────────────
   1 │     1      4      7    
   2 │     2      5      8    
   3 │     3      6      9    

How would I obtain the desired output of something like

# desired output not actual output from Julia  
df
3x9 DataFrame
Row  │ x      y      z      y.a     y.b     y.c     z.a    z.b    z.c          
     │ Int64  Int64  Int64  Int64   Int64   Float64 Int64  Int64  Float64  
─────┼──────────────────────────-----------------------------------------
   1 │     1      4      7     4     24    -1.0      4     42      2.75
   2 │     2      5      8     8     30     6.5      8     48      14.0
   3 │     3      6      9     12    36    16.5      12    54      27.75   

I thought maybe something like


@transform!(df, AsTable = fun(:x, {Between(:y, :z)},4,5,6)) 

but this returns an error syntax:{ } vector syntax is discontinued

The error I get is ArgumentError: duplicate output column name: :E, which is indeed the problem: with AsTable we say that column names are specified in the named tuples produced by funnt. But we produce the same names three times! (one for each of :A, :B and :C).

Summary: If you want to choose the column names in fun, then you must make sure that you produce unique column names.

One solution is to rename the columns after fun, but before you give them to transform:

using DataFrames

df = DataFrame(id=1:3, A=11:13, B=101:103, C=25:27);
gdf = groupby(df, :id)

function fun(Col1, Col2, p1, p2, p3)
   (E,F,G) = (Vector{Float64}(undef, size(Col1,1)) for _ in 1:3)  
    for i in eachindex(Col1)
       if p3>0
           E[i] = p1*Col1[i]+p2*Col2[i]
           F[i] = (p1+Col1[i])*(p2+Col2[i])
           G[i] = p2*Col1[i]+p1*Col2[i]
       end
    end
    return (; E, F, G)
end

suffix_fields(nt, suffix) = NamedTuple(Symbol(k, "_", suffix) => v for (k,v) in pairs(nt))

transform(gdf, [[:id, col] => ((x,y) -> suffix_fields(fun(x,y,1,2,3), col)) => AsTable for col in [:A, :B, :C]])

# Output:
3×13 DataFrame
 Row │ id     A      B      C      E_A      F_A      G_A      E_B      F_B      G_B      E_C      F_C      G_C     
     │ Int64  Int64  Int64  Int64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     23.0     26.0     13.0    203.0    206.0    103.0     51.0     54.0     27.0
   2 │     2     12    102     26     26.0     42.0     16.0    206.0    312.0    106.0     54.0     84.0     30.0
   3 │     3     13    103     27     29.0     60.0     19.0    209.0    420.0    109.0     57.0    116.0     33.0

Another way is to give fun itself another parameter it can use to make unique names:

function fun(Col1, Col2, p1, p2, p3, suffix)
   (E,F,G) = (Vector{Float64}(undef, size(Col1,1)) for _ in 1:3)  

    for i in eachindex(Col1)
       if p3>0
           E[i] = p1*Col1[i]+p2*Col2[i]
           F[i] = (p1+Col1[i])*(p2+Col2[i])
           G[i] = p2*Col1[i]+p1*Col2[i]
       end
    end
    return (;
        Symbol("E_$suffix") => E,
        Symbol("F_$suffix") => F,
        Symbol("G_$suffix") => G,
        )
end

transform(gdf, [[:id, col] => ((x,y) -> fun(x,y,1,2,3,col)) => AsTable for col in [:A, :B, :C]])

By the way, it’s easier to help you if you always show self-contained examples like the first one here, so that people can try your code with a simple copy-and-paste (rather than having to collect pieces from several messages in the thread).

1 Like

I think at that level of complexity you’re better served with the normal functional interface, I don’t think it’s practical to combine the implicit broadcast of columns with different names for the named tuples in the macro. It’s easier to do that in the way shown above.

1 Like

If I understood correctly, you would like to deal with the case of a generic number of input columns (as done with the funny function) but which outputs a namedtuple of vectors.
In addition, the function that transforms the columns must use additional parameters.
The following funnynt() should implement the scheme described.

The example is limited to the case of three specific output columns.
If you want a more general schema with different number of columns and with dynamically generated column names, specify better and we can try to make the generalization.


function funnynt(id,Cols...; p1=1,p2=2,p3=3)
    (E,F,G) = [Vector{Float64}(undef, size(Cols[1],1)) for _ in 1:3]      
     for i in eachindex(id)
        if p3>0
            E[i]=p1*id[i]+p2*Cols[1][i]
            F[i]=(p1+id[i])*(p2+Cols[2][i])
            G[i]=p2*Cols[1][i]+p1*Cols[2][i]+p3*Cols[3][i]
        else
            E[i]=p1*id[i]+p2*Cols[1][i]
            F[i]=(p1+id[i])*(p2+Cols[2][i])
            G[i]=p2*Cols[1][i]+p1*Cols[2][i]+p3*Cols[3][i]
        end
     end
     return ( E = E, F= F, G=G)
end
transform(gdf,Cols(:) => funnynt=>AsTable)  # p1, p2, p3 have default values

transform(gdf,Cols(:) =>((x...)->funnynt(x...,p1=0;p3=0))=>AsTable)




julia> df = DataFrame(id=rand(1:3,10), A=rand(11:13,10),       
       B=rand(101:103,10), C = rand(25:27,10), D = rand(32:34,10))
10×5 DataFrame
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │     1     13    101     27     32
   2 │     3     13    103     26     33
   3 │     1     11    101     26     33
   4 │     3     13    101     26     33
   5 │     3     11    101     26     33
   6 │     2     11    103     25     32
   7 │     2     12    103     27     33
   8 │     2     11    101     25     33
   9 │     1     12    101     27     33
  10 │     3     13    101     26     33
julia> transform(gdf,Cols(:) =>((x...)->funnynt(x...,p1=0;p3=0))=>AsTable)
10×8 DataFrame
 Row │ id     A      B      C      D      E        F        G           
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64     
─────┼──────────────────────────────────────────────────────────────    
   1 │     1     13    101     27     32     26.0    103.0     26.0     
   2 │     3     13    103     26     33     26.0    315.0     26.0     
   3 │     1     11    101     26     33     22.0    103.0     22.0     
   4 │     3     13    101     26     33     26.0    309.0     26.0     
   5 │     3     11    101     26     33     22.0    309.0     22.0     
   6 │     2     11    103     25     32     22.0    210.0     22.0     
   7 │     2     12    103     27     33     24.0    210.0     24.0     
   8 │     2     11    101     25     33     22.0    206.0     22.0     
   9 │     1     12    101     27     33     24.0    103.0     24.0     
  10 │     3     13    101     26     33     26.0    309.0     26.0     

julia> transform(gdf,Cols(:) => funnynt=>AsTable)
10×8 DataFrame
 Row │ id     A      B      C      D      E        F        G           
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64     
─────┼──────────────────────────────────────────────────────────────    
   1 │     1     13    101     27     32     27.0    206.0    208.0     
   2 │     3     13    103     26     33     29.0    420.0    207.0     
   3 │     1     11    101     26     33     23.0    206.0    201.0     
   4 │     3     13    101     26     33     29.0    412.0    205.0     
   5 │     3     11    101     26     33     25.0    412.0    201.0     
   6 │     2     11    103     25     32     24.0    315.0    200.0     
   7 │     2     12    103     27     33     26.0    315.0    208.0     
   8 │     2     11    101     25     33     24.0    309.0    198.0     
   9 │     1     12    101     27     33     25.0    206.0    206.0     
  10 │     3     13    101     26     33     29.0    412.0    205.0   

output generic number of columns and with dynamically generated column names

julia> function funnynt(id,Cols...; p1=1,p2=2,p3=3)
           m= Matrix{Float64}(undef, length(id),length(Cols))
           for i in eachindex(Cols)
               if p3>0
                   m[:,i]= (id .+p1).*(Cols[i].+p2*p3)
               else
                   m[:,i]= (id .+p1).*(Cols[i].+p2)
               end
           end
           #ncols=Symbol.("id_c",1:length(Cols))
           ncols=Symbol.("id_",names(df,Not(:id)))

           return (;zip(ncols,eachcol(m))...)
       end
funnynt (generic function with 1 method)

julia> transform(gdf,Cols(:) => funnynt=>AsTable)
10×9 DataFrame
 Row │ id     A      B      C      D      id_A     id_B     id_C     id_D    
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64  Float64 
─────┼───────────────────────────────────────────────────────────────────────
   1 │     2     12    101     27     34     54.0    321.0     99.0    120.0
   2 │     3     12    103     27     33     72.0    436.0    132.0    156.0
   3 │     2     12    103     27     32     54.0    327.0     99.0    114.0
   4 │     2     12    101     25     33     54.0    321.0     93.0    117.0
   5 │     1     11    102     27     32     34.0    216.0     66.0     76.0
   6 │     1     12    101     25     32     36.0    214.0     62.0     76.0
   7 │     2     13    103     26     33     57.0    327.0     96.0    117.0
   8 │     2     13    102     26     34     57.0    324.0     96.0    120.0
   9 │     2     12    101     27     33     54.0    321.0     99.0    117.0
  10 │     3     11    102     27     33     68.0    432.0    132.0    156.0
1 Like

Thanks so much I really appreciate you taking the time to walk me through this and showing me how to post better questions. Just one last thing - could you explain the use of the ; in the return statement of both examples of fun? e.g.

return (; E, F, G)

and in the second example.

return (;
        Symbol("E_$suffix") => E,
        Symbol("F_$suffix") => F,
        Symbol("G_$suffix") => G,
        )

I know the ; is important because if I remove it and try

return (
        Symbol("E_$suffix") => E,
        Symbol("F_$suffix") => F,
        Symbol("G_$suffix") => G,
        )

I get the same ArgumentError: duplicate output column name: as before. The help documentation says that in function declarations ; separates regular arguments from keywords, but I’m not sure how that is applicable here.

1 Like