Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language

If you’re interested in using a macro convenience package, https://github.com/jkrumbiegel/DataFrameMacros.jl makes such multi-column operations simpler to write.

Every symbol like :id or expression within { } is understood as one or many columns. The whole function expression is then broadcast over all of these collections of columns. In your case that could be one of the following options:

using DataFrameMacros
using DataFrames

df = DataFrame(id=1:3, A=11:13, B=101:103, C = 25:27, D = 32:34)
gdf = groupby(df, :id)

@transform!(gdf, :id + {[:A, :B, :C, :D]})
@transform!(gdf, :id + {Between(:A, :D)})
@transform!(gdf, :id + {Not(:id)})
@transform!(gdf, :id + {r"[ABCD]"})

The nice thing is that DataFrameMacros handles the conversion to the same vectors of strings that @sijo was constructing manually via vcat.( for you. You can use every format that you’d normally use as selector in names(df, selector).

For a little more convenience, you can quickly name your new columns with a shortcut syntax:

Compare

julia> @transform!(gdf, :id + {[:A, :B, :C, :D]})
3×9 DataFrame
 Row │ id     A      B      C      D      id_A_+  id_B_+  id_C_+  id_D_+ 
     │ Int64  Int64  Int64  Int64  Int64  Int64   Int64   Int64   Int64  
─────┼───────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32      12     102      26      33
   2 │     2     12    102     26     33      14     104      28      35
   3 │     3     13    103     27     34      16     106      30      37

with

julia> @transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})
3×9 DataFrame
 Row │ id     A      B      C      D      id_plus_A  id_plus_B  id_plus_C  id_plus_D 
     │ Int64  Int64  Int64  Int64  Int64  Int64      Int64      Int64      Int64     
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32         12        102         26         33
   2 │     2     12    102     26     33         14        104         28         35
   3 │     3     13    103     27     34         16        106         30         37
3 Likes

another way could be to move the “complexity” from the mini-language into the function

gdf = groupby(DataFrame(id=1:3, A=11:13, B=101:103, C = 25:27, D = 32:34), :id)

funny( col1, cols...)=mapreduce(c->(col1 .+5).*(c.+5), hcat, cols)


transform(gdf, Cols(:)=>funny=>[:a, :b, :c, :d])

transform(gdf, Cols(:)=>funny=>AsTable)

transform(gdf, Cols(:)=>funny=>"new_".*names(gdf)[2:5])
1 Like

Hi thanks so much for this detailed example! It really helped me think through the problem. Sorry it has taken me so long to work through this.

Is there a way to use the array comprehension method to include the naming of columns?

I ask because in trying to get the array comprehension to work I am getting something like the following error:


ERROR: ArgumentError: a single value or vector result is required (got NamedTuple{(:R, :S, :V, :P, :T, :J), NTuple{6, Vector{Float64}}})

In this case fun actually returns a NamedTuple to be stored in multiple columns of gdfso the minilanguage for a single transformation would actually look like.


transform!(gdf, [:id,:A] => ((x,y, p1, p2 )->fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

(Sorry in my inexperience in adhering to the posting guidelines I over simplified the original problem.)

if I try something like


transform(gdf, [[:id, col] => fun => AsTable for col in [:A, :B, :C]`])

I get an error like


ERROR: ArgumentError: duplicate output column name: :R

which I assume is because multiple instances of transform are taking place as a single instance like


transform(gdf, [:id, :A] => fun => AsTable)

does work as intended. I know @jules’s post touches on this with DataFrameMacros but was would outputting to individual named columns with just the mini language and array comprehension be possible?

Thanks so much this is really cool! Richard Sutton once said there is no such thing as a stupid question, so… I actually have 2 for you.

  1. Are there downsides in terms of stability or performance is using a macro convenience package over the mini language itself? It seems a little bit like those 0 calories sodas where you feel like there has to be a catch somewhere.

  2. Sorry I’m still a little unclear on the syntax here …so if fun itself has several parameters like

fun(col1, col2, p1, p2)

and the transform of a single instances would look like


transform!(gdf, [:id,:A] => ((x,y, p1, p2 )-> fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

would you very much illustrating how it would look in DataFrameMacros?

Thanks so much! I’m still trying to wrap my head around this approach. Is this approach limited to more straight forward calculations? Can it still work if there were additional parameters in the function and the output was a NamedTuple ?
e.g.

function fun(Col1,Col2, par1, par2, par3)
    (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]      
     # some looped calculation with if/then statements 
     return ( E = E, F= F, G=G)
end

If I understand the meaning correctly, the answer is yes.
And this should be the form that handles the output as a namedtuple.
If you provide even a minimal case of your situation we can attempt a test.

transform(gdf, Cols(:)=>funny=>AsTable)

A silly example follows

julia> function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

julia> combine(df,[:A,:B] => ((x,y)->funnt(x,y,1,2,3))=>AsTable)
3×3 DataFrame
 Row │ E        F        G       
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │   213.0   1236.0    123.0
   2 │   216.0   1352.0    126.0
   3 │   219.0   1470.0    129.0

julia> transform(df,[:A,:B] => ((x,y)->funnt(x,y,1,2,3))=>AsTable)
3×8 DataFrame
 Row │ id     A      B      C      D      E        F        G              
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64        
─────┼──────────────────────────────────────────────────────────────       
   1 │     1     11    101     25     32    213.0   1236.0    123.0        
   2 │     2     12    102     26     33    216.0   1352.0    126.0        
   3 │     3     13    103     27     34    219.0   1470.0    129.0   

in this form if the columns to be treated were many

julia> transform(df,[:A,:B] => ((x...)->funnt(x...,1,2,3))=>AsTable)       
3×8 DataFrame
 Row │ id     A      B      C      D      E        F        G              
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64        
─────┼──────────────────────────────────────────────────────────────       
   1 │     1     11    101     25     32    213.0   1236.0    123.0        
   2 │     2     12    102     26     33    216.0   1352.0    126.0        
   3 │     3     13    103     27     34    219.0   1470.0    129.0  
1 Like

There are no downsides for the meat of the transformation, which is running the transform function itself. The macro just exploits redundancies that are common in the specification of the mini-language, for example you have to specify input columns, but you also have to specify input arguments for the inner function which are always the same number as the input columns.

I have an unregistered package, GitHub - jkrumbiegel/PrettyMacroExpand.jl which is a bit nicer to use than @macroexpand to get a readable version of the code it outputs (it uses animal names for temporary macro-defined variables). Here’s a commented version of that, which will show you that at the core it’s not different than what you’d do manually, just with some bells and whistles to handle more complex cases:

using PrettyMacroExpand
using DataFrameMacros

@prettyexpand @transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})

DataFrameMacros.transform!(
    gdf,
    begin
        # store vector of column specifiers found in expression, converted to strings via `names(df, ...)`
        goshawk = ["id", DataFrameMacros.stringargs(Between(:A, :D), gdf)]

        # this is for the case where there are `{{ }}` expressions found, doesn't matter in this case
        wildebeest = DataFrameMacros.Tuple(
            DataFrameMacros.map(goshawk) do ibex
                ibex isa DataFrameMacros.Ref && return DataFrameMacros.Val(
                    DataFrameMacros.length(ibex[]),
                )
                return DataFrameMacros.nothing
            end,
        )

        # a reference to the function used, if this is not a simple call there would be
        # an anonymous function defined here
        squirrel = (+)

        # this is again for the case where there's a {{ }}, doesn't matter here, it will just return (+)
        crane = DataFrameMacros.makefunc(squirrel, wildebeest)

        # in DataFrameMacros, by default we use ByRow because I like it better
        finch = DataFrameMacros.ByRow(crane)

        # this is the function that executes the name transformation in "id_plus_{2}"
        butterfly = (
            (pheasant,) ->
                "id_plus_$(DataFrameMacros.getindex_colnames(pheasant, 2))"
        )

        # now we compute the output column names using that function and the input column names
        dotterel =
            butterfly.(
                vcat.("id", DataFrameMacros.stringargs(Between(:A, :D), gdf))
            )

        # the final mini-language construct is built, as you can see the input column vector is broadcasted
        # into pairs with the transformation function and the output columns, that's why
        # :id + {Between(:A, :D)} works.
        # In the object constructed by this call there are no inefficiencies compared to what you'd do manually.
        vcat.("id", DataFrameMacros.stringargs(Between(:A, :D), gdf)) .=>
            (finch .=> dotterel)
    end;
)

Looking at this, it is a little inefficient to look up the column names multiple times, but that is negligible compared to the cost of running the computation in most cases. I could also spend some time to improve the macro so it only computes the input column names once.

Regarding your second question, I don’t get what you’re trying to do here:

transform!(gdf, [:id,:A] => ((x,y, p1, p2 )-> fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

This wouldn’t work as there are just two input columns but four input arguments to the inner function and six output columns. Could you clarify what you’re trying to do, maybe with a working example I can translate to DataFrameMacros.jl syntax for you?

2 Likes

You can add => name inside of the array comprehension:

julia> transform(gdf, [[:id, col] => fun => "id_plus_$col" for col in [:A, :B, :C]])
3×7 DataFrame
 Row │ id     A      B      C      id_plus_A  id_plus_B  id_plus_C 
     │ Int64  Int64  Int64  Int64  Int64      Int64      Int64     
─────┼─────────────────────────────────────────────────────────────
   1 │     1     11    101     25         96        636        180
   2 │     2     12    102     26        119        749        217
   3 │     3     13    103     27        144        864        256

And here’s a variant using a function to generate the output column names based on the two input names:

make_name(cols) = "$(cols[1])_plus_$(cols[2])"  # or maybe nicer: string(cols[1], "_plus_", cols[2])

transform(gdf, [[:id, col] => fun => make_name for col in [:A, :B, :C]])

# Output
3×7 DataFrame
 Row │ id     A      B      C      id_plus_A  id_plus_B  id_plus_C 
     │ Int64  Int64  Int64  Int64  Int64      Int64      Int64     
─────┼─────────────────────────────────────────────────────────────
   1 │     1     11    101     25         96        636        180
   2 │     2     12    102     26        119        749        217
   3 │     3     13    103     27        144        864        256

As for having fun return the column names using a named tuple: that would be quite a different thing to do… In the above examples, fun is called separately on each column and it’s not clear why/how it should itself specify the output column name. Usually a named tuple is used when the processing function returns several columns (in this case it makes sense that the processing function would know how to name these columns). Anyway here’s one way to do in our case here:

# Same definition of fun as before
function fun( col1, col2)
    return @. (col1 +5) * (col2 +5) 
end

# Auxiliary function that gets the table from DataFrames
# and makes a table of fun results
function fun_on_table(t)
    # Make output names based on the column names
    names = Symbol.(keys(t)[1], "_plus_", keys(t)[2:end])
    # Call fun multiple times, passing first column and each of the others,
    # wrapping all outputs in a named tuple
    return NamedTuple(names .=> [fun(t[1], t[c]) for c in keys(t)[2:end]])
end

# Use like this:
transform(gdf, AsTable([:id, :A, :B, :C]) => fun_on_table => AsTable)

# Output:
3×7 DataFrame
 Row │ id     A      B      C      id_plus_A  id_plus_B  id_plus_C 
     │ Int64  Int64  Int64  Int64  Int64      Int64      Int64     
─────┼─────────────────────────────────────────────────────────────
   1 │     1     11    101     25         96        636        180
   2 │     2     12    102     26        119        749        217
   3 │     3     13    103     27        144        864        256

1 Like

Thanks so much for this really useful example. It’s exactly the issue I am having trouble with. (Sorry for the delay in response I just got done with exams and then got a bout of food poisoning. ) If you have time for a follow up, and I totally understand if you are busy due to the holidays. But the issue I am having is how do you then implement this function in your example

julia> function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

Into the mapreduce method you discussed when advising moving the complexity into the function. I don’t know if you would mind walking me through lines #2 and #3 of the code in your earlier example but if I understood it correctly.

# 1)
gdf = groupby(DataFrame(id=1:3, A=11:13, B=101:103, C = 25:27, D = 32:34), :id)

# 2) 
funny( col1, cols...)=mapreduce(c->(col1 .+5).*(c.+5), hcat, cols)

Here we are taking @. c->(col1+5)*(c+5) applying it to cols and then concatenating them horizontally. the ... operator means that there can be a variable amount of columns funny() applies to. So here

transform(gdf, Cols(:)=>funny=>[:a, :b, :c, :d])

funny() takes the :id as col1 and the remaining columns as the cols... argument. But how does this mapreduce method work with funnt() where we now have additional parameters and output vectors. Because in this format


transform(df,[:A,:B] => ((x,y)->funnt(x,y,1,2,3))=>AsTable)

I am not sure how to circumvent the issue of entering the columns individually that mapreduce() eliminated?

I really appreciate you sharing this! I’m still working through it but it does help me get a better sense of how macros work. (Sorry for the delay in response, getting over a bout of food poisoning. ) I guess the issue I am having and what I was trying to illustrate with

transform!(gdf, [:id,:A] => ((x,y, p1, p2 )-> fun(x,y, p1, p2))=> [:Aid1, :Aid2, :Aid3, :Aid4, :Aid5, :Aid6])

Is that the function fun itself takes as input two columns of the DataFrame but will output multiple columns to the DataFrame. So as per rocco_sprmnt21’s example where funnt outputs a NamedTuple where each vector of the NamedTuple is to be added to the DataFrame

function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

So I’m just not quite sure how to fit funnt into the DataFrameMacros format of something like

transform!(gdf, [:id,:A] => ((x,y)-> funnt(x,y,1,2,3 ))=> [:Aid1, :Aid2, :Aid3])
3×8 DataFrame
 Row │ id     A      B      C      D      Aid1     Aid2     Aid3    
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64 
─────┼──────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32     23.0     26.0     13.0
   2 │     2     12    102     26     33     26.0     42.0     16.0
   3 │     3     13    103     27     34     29.0     60.0     19.0

into the simpler DataFrameMacros format you provided for multiple columns of the DataFrame.

@transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})

I’ll understand if you are busy but in any case thanks again for taking the time in explaining all this.

In DataFrameMacros.jl, the left-hand side is usually used as is for the sink in DataFrames.jl’s source .=> function .=> sink minilanguage. Exceptions are the "{}_some_suffix" string syntax and when {} appears. So in order to work with a NamedTuple output from some function, instead of => AsTable in DataFrames, you do AsTable = .... If your function works only for vector arguments, you should use @bycol (only @combine uses that by default). In the example below, both versions work because broadcasting with scalars returns scalars:

julia> df = DataFrame(x = [1, 2, 3], y = [4, 5, 6])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> fun(x, y, i, j, k) = (; a = x .* i, b = y .* k, c = x .* y ./ i .* j .- k)
fun (generic function with 2 methods)

julia> df = DataFrame(x = [1, 2, 3], y = [4, 5, 6])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> @select df AsTable = fun(:x, :y, 4, 5, 6)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     4     24     -1.0
   2 │     8     30      6.5
   3 │    12     36     16.5

# in case your function only works on vectors
julia> @select df AsTable = @bycol fun(:x, :y, 4, 5, 6)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     4     24     -1.0
   2 │     8     30      6.5
   3 │    12     36     16.5
1 Like

Thanks so much for explaining this. This is really great and actually helps with several other issues I’ve been trying to sort out (Sorry for the delay in response, finished exams and then got food poisoning.) Would really appreciate if you would have time to clear up some issues I’m still having with the array comprehension method you described, but I totally understand if you are busy.

Sorry I didn’t make this clear in the follow up. In this case fun is returning multiple columns that would be added to the DataFrame. So again borrowing from rocco_sprmnt21’s example
if

function funnt(Col1,Col2, p1, p2, p3)
           (E,F,G) = [Vector{Float64}(undef, size(Col1,1)) for _ in 1:3]   

            for i in eachindex(Col1)
               if p3>0
                   E[i]=p1*Col1[i]+p2*Col2[i]
                   F[i]=(p1+Col1[i])*(p2+Col2[i])
                   G[i]=p2*Col1[i]+p1*Col2[i]
               end
            end
            return ( E = E, F= F, G=G)
       end
funnt (generic function with 1 method)

then the array comprehension of

transform(gdf, [[:id, col] => ((x,y) -> funnt(x,y,1,2,3)) => AsTable  for col in [:A, :B, :C]])

returns an error. So even if the function itself handled the naming element it seems like array comprehension wouldn’t work in this format.

transform(gdf, [[:id, col] => fun for col in [:A, :B, :C]]) 

(Because otherwise it should have auto named the columns with the AsTable function?). Just wondering if there a work around for a function that outputs multiple columns in the array comprehension?

Thanks! Sorry I’m not that quick on the update here but how would I combine this with the method of taking input multiple columns in your previous post?

so previously there was the format

@transform!(gdf, "id_plus_{2}" = :id + {Between(:A, :D)})
3×9 DataFrame
 Row │ id     A      B      C      D      id_plus_A  id_plus_B  id_plus_C  id_plus_D 
     │ Int64  Int64  Int64  Int64  Int64  Int64      Int64      Int64      Int64     
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     32         12        102         26         33
   2 │     2     12    102     26     33         14        104         28         35
   3 │     3     13    103     27     34         16        106         30         37

so here if

df.z = [7,8,9]
df
3×4 DataFrame
 Row │ x      y      z          
     │ Int64  Int64  Int64  
─────┼────────────────────────────
   1 │     1      4      7    
   2 │     2      5      8    
   3 │     3      6      9    

How would I obtain the desired output of something like

# desired output not actual output from Julia  
df
3x9 DataFrame
Row  │ x      y      z      y.a     y.b     y.c     z.a    z.b    z.c          
     │ Int64  Int64  Int64  Int64   Int64   Float64 Int64  Int64  Float64  
─────┼──────────────────────────-----------------------------------------
   1 │     1      4      7     4     24    -1.0      4     42      2.75
   2 │     2      5      8     8     30     6.5      8     48      14.0
   3 │     3      6      9     12    36    16.5      12    54      27.75   

I thought maybe something like


@transform!(df, AsTable = fun(:x, {Between(:y, :z)},4,5,6)) 

but this returns an error syntax:{ } vector syntax is discontinued

The error I get is ArgumentError: duplicate output column name: :E, which is indeed the problem: with AsTable we say that column names are specified in the named tuples produced by funnt. But we produce the same names three times! (one for each of :A, :B and :C).

Summary: If you want to choose the column names in fun, then you must make sure that you produce unique column names.

One solution is to rename the columns after fun, but before you give them to transform:

using DataFrames

df = DataFrame(id=1:3, A=11:13, B=101:103, C=25:27);
gdf = groupby(df, :id)

function fun(Col1, Col2, p1, p2, p3)
   (E,F,G) = (Vector{Float64}(undef, size(Col1,1)) for _ in 1:3)  
    for i in eachindex(Col1)
       if p3>0
           E[i] = p1*Col1[i]+p2*Col2[i]
           F[i] = (p1+Col1[i])*(p2+Col2[i])
           G[i] = p2*Col1[i]+p1*Col2[i]
       end
    end
    return (; E, F, G)
end

suffix_fields(nt, suffix) = NamedTuple(Symbol(k, "_", suffix) => v for (k,v) in pairs(nt))

transform(gdf, [[:id, col] => ((x,y) -> suffix_fields(fun(x,y,1,2,3), col)) => AsTable for col in [:A, :B, :C]])

# Output:
3×13 DataFrame
 Row │ id     A      B      C      E_A      F_A      G_A      E_B      F_B      G_B      E_C      F_C      G_C     
     │ Int64  Int64  Int64  Int64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     1     11    101     25     23.0     26.0     13.0    203.0    206.0    103.0     51.0     54.0     27.0
   2 │     2     12    102     26     26.0     42.0     16.0    206.0    312.0    106.0     54.0     84.0     30.0
   3 │     3     13    103     27     29.0     60.0     19.0    209.0    420.0    109.0     57.0    116.0     33.0

Another way is to give fun itself another parameter it can use to make unique names:

function fun(Col1, Col2, p1, p2, p3, suffix)
   (E,F,G) = (Vector{Float64}(undef, size(Col1,1)) for _ in 1:3)  

    for i in eachindex(Col1)
       if p3>0
           E[i] = p1*Col1[i]+p2*Col2[i]
           F[i] = (p1+Col1[i])*(p2+Col2[i])
           G[i] = p2*Col1[i]+p1*Col2[i]
       end
    end
    return (;
        Symbol("E_$suffix") => E,
        Symbol("F_$suffix") => F,
        Symbol("G_$suffix") => G,
        )
end

transform(gdf, [[:id, col] => ((x,y) -> fun(x,y,1,2,3,col)) => AsTable for col in [:A, :B, :C]])

By the way, it’s easier to help you if you always show self-contained examples like the first one here, so that people can try your code with a simple copy-and-paste (rather than having to collect pieces from several messages in the thread).

1 Like

I think at that level of complexity you’re better served with the normal functional interface, I don’t think it’s practical to combine the implicit broadcast of columns with different names for the named tuples in the macro. It’s easier to do that in the way shown above.

1 Like

If I understood correctly, you would like to deal with the case of a generic number of input columns (as done with the funny function) but which outputs a namedtuple of vectors.
In addition, the function that transforms the columns must use additional parameters.
The following funnynt() should implement the scheme described.

The example is limited to the case of three specific output columns.
If you want a more general schema with different number of columns and with dynamically generated column names, specify better and we can try to make the generalization.


function funnynt(id,Cols...; p1=1,p2=2,p3=3)
    (E,F,G) = [Vector{Float64}(undef, size(Cols[1],1)) for _ in 1:3]      
     for i in eachindex(id)
        if p3>0
            E[i]=p1*id[i]+p2*Cols[1][i]
            F[i]=(p1+id[i])*(p2+Cols[2][i])
            G[i]=p2*Cols[1][i]+p1*Cols[2][i]+p3*Cols[3][i]
        else
            E[i]=p1*id[i]+p2*Cols[1][i]
            F[i]=(p1+id[i])*(p2+Cols[2][i])
            G[i]=p2*Cols[1][i]+p1*Cols[2][i]+p3*Cols[3][i]
        end
     end
     return ( E = E, F= F, G=G)
end
transform(gdf,Cols(:) => funnynt=>AsTable)  # p1, p2, p3 have default values

transform(gdf,Cols(:) =>((x...)->funnynt(x...,p1=0;p3=0))=>AsTable)




julia> df = DataFrame(id=rand(1:3,10), A=rand(11:13,10),       
       B=rand(101:103,10), C = rand(25:27,10), D = rand(32:34,10))
10×5 DataFrame
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │     1     13    101     27     32
   2 │     3     13    103     26     33
   3 │     1     11    101     26     33
   4 │     3     13    101     26     33
   5 │     3     11    101     26     33
   6 │     2     11    103     25     32
   7 │     2     12    103     27     33
   8 │     2     11    101     25     33
   9 │     1     12    101     27     33
  10 │     3     13    101     26     33
julia> transform(gdf,Cols(:) =>((x...)->funnynt(x...,p1=0;p3=0))=>AsTable)
10×8 DataFrame
 Row │ id     A      B      C      D      E        F        G           
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64     
─────┼──────────────────────────────────────────────────────────────    
   1 │     1     13    101     27     32     26.0    103.0     26.0     
   2 │     3     13    103     26     33     26.0    315.0     26.0     
   3 │     1     11    101     26     33     22.0    103.0     22.0     
   4 │     3     13    101     26     33     26.0    309.0     26.0     
   5 │     3     11    101     26     33     22.0    309.0     22.0     
   6 │     2     11    103     25     32     22.0    210.0     22.0     
   7 │     2     12    103     27     33     24.0    210.0     24.0     
   8 │     2     11    101     25     33     22.0    206.0     22.0     
   9 │     1     12    101     27     33     24.0    103.0     24.0     
  10 │     3     13    101     26     33     26.0    309.0     26.0     

julia> transform(gdf,Cols(:) => funnynt=>AsTable)
10×8 DataFrame
 Row │ id     A      B      C      D      E        F        G           
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64     
─────┼──────────────────────────────────────────────────────────────    
   1 │     1     13    101     27     32     27.0    206.0    208.0     
   2 │     3     13    103     26     33     29.0    420.0    207.0     
   3 │     1     11    101     26     33     23.0    206.0    201.0     
   4 │     3     13    101     26     33     29.0    412.0    205.0     
   5 │     3     11    101     26     33     25.0    412.0    201.0     
   6 │     2     11    103     25     32     24.0    315.0    200.0     
   7 │     2     12    103     27     33     26.0    315.0    208.0     
   8 │     2     11    101     25     33     24.0    309.0    198.0     
   9 │     1     12    101     27     33     25.0    206.0    206.0     
  10 │     3     13    101     26     33     29.0    412.0    205.0   

output generic number of columns and with dynamically generated column names

julia> function funnynt(id,Cols...; p1=1,p2=2,p3=3)
           m= Matrix{Float64}(undef, length(id),length(Cols))
           for i in eachindex(Cols)
               if p3>0
                   m[:,i]= (id .+p1).*(Cols[i].+p2*p3)
               else
                   m[:,i]= (id .+p1).*(Cols[i].+p2)
               end
           end
           #ncols=Symbol.("id_c",1:length(Cols))
           ncols=Symbol.("id_",names(df,Not(:id)))

           return (;zip(ncols,eachcol(m))...)
       end
funnynt (generic function with 1 method)

julia> transform(gdf,Cols(:) => funnynt=>AsTable)
10×9 DataFrame
 Row │ id     A      B      C      D      id_A     id_B     id_C     id_D    
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64  Float64 
─────┼───────────────────────────────────────────────────────────────────────
   1 │     2     12    101     27     34     54.0    321.0     99.0    120.0
   2 │     3     12    103     27     33     72.0    436.0    132.0    156.0
   3 │     2     12    103     27     32     54.0    327.0     99.0    114.0
   4 │     2     12    101     25     33     54.0    321.0     93.0    117.0
   5 │     1     11    102     27     32     34.0    216.0     66.0     76.0
   6 │     1     12    101     25     32     36.0    214.0     62.0     76.0
   7 │     2     13    103     26     33     57.0    327.0     96.0    117.0
   8 │     2     13    102     26     34     57.0    324.0     96.0    120.0
   9 │     2     12    101     27     33     54.0    321.0     99.0    117.0
  10 │     3     11    102     27     33     68.0    432.0    132.0    156.0
1 Like

Thanks so much I really appreciate you taking the time to walk me through this and showing me how to post better questions. Just one last thing - could you explain the use of the ; in the return statement of both examples of fun? e.g.

return (; E, F, G)

and in the second example.

return (;
        Symbol("E_$suffix") => E,
        Symbol("F_$suffix") => F,
        Symbol("G_$suffix") => G,
        )

I know the ; is important because if I remove it and try

return (
        Symbol("E_$suffix") => E,
        Symbol("F_$suffix") => F,
        Symbol("G_$suffix") => G,
        )

I get the same ArgumentError: duplicate output column name: as before. The help documentation says that in function declarations ; separates regular arguments from keywords, but I’m not sure how that is applicable here.

1 Like

Thank you so much! Sorry I would have liked to mark multiple solutions but wasn’t sure how to do it. If you have a moment could you clarify how the zip and ... commands are working here?

When I add the splat ... operator after the Cols argument in the function declaration does that turn that argument in the function into an array? So when I input Cols... into funnynt it would be in the form [col1,col2.col3]

Also I am not sure what I am getting wrong here. But when I try

df  = DataFrame(id=rand(1:3,10), A=rand(11:13,10),       
              B=rand(101:103,10), C = rand(25:27,10), D = rand(32:34,10))
10×5 DataFrame
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     1     12    102     26     32
   2 │     1     13    101     25     34
   3 │     2     13    102     26     33
   4 │     3     11    102     27     33
   5 │     1     13    102     26     33
   6 │     3     11    101     25     32
   7 │     1     11    103     26     34
   8 │     2     12    101     27     34
   9 │     2     12    102     25     32
  10 │     2     12    101     25     32
gdf = groupby(df,:id)
GroupedDataFrame with 3 groups based on key: id
First Group (4 rows): id = 1
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     1     12    102     26     32
   2 │     1     13    101     25     34
   3 │     1     13    102     26     33
   4 │     1     11    103     26     34
⋮
Last Group (2 rows): id = 3
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     3     11    102     27     33
   2 │     3     11    101     25     32
function funnynt(id,Cols...; p1=1,p2=2,p3=3)
                  m= Matrix{Float64}(undef, length(id),length(Cols))
                  for i in eachindex(Cols)
                      if p3>0
                          m[:,i]= (id .+p1).*(Cols[i].+p2*p3)
                      else
                          m[:,i]= (id .+p1).*(Cols[i].+p2)
                      end
                  end
                  ncols=Symbol.("id_",names(df,Not(:id)))
                  return (;zip(ncols,eachcol(m))...)
              end
funnynt (generic function with 1 method)

transform(gdf,Cols(:) => funnynt => AsTable)

I get

ERROR: NamedTuple names and field types must have matching lengths

Again I apologize if I am overlooking something elementary, but thanks so much for your time and help.

Here my log, starting from a clean session.
See if you can see any difference.

julia> using DataFrames

julia> df = DataFrame(id=rand(1:3,10), A=rand(11:13,10),
       B=rand(101:103,10), C = rand(25:27,10), D = rand(32:34,10))
10×5 DataFrame
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     1     13    103     26     32
   2 │     2     13    101     26     33
   3 │     3     13    103     25     32
   4 │     2     11    101     26     32
   5 │     2     12    102     25     34
   6 │     1     11    102     25     34
   7 │     2     11    103     26     32
   8 │     3     13    102     25     32
   9 │     1     12    103     25     33
  10 │     3     11    101     26     32

julia> gdf = groupby(df, :id)
GroupedDataFrame with 3 groups based on key: id
First Group (3 rows): id = 1
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     1     13    103     26     32
   2 │     1     11    102     25     34
   3 │     1     12    103     25     33
⋮
Last Group (3 rows): id = 3
 Row │ id     A      B      C      D     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     3     13    103     25     32
   2 │     3     13    102     25     32
   3 │     3     11    101     26     32

julia> function funnynt(id,Cols...; p1=1,p2=2,p3=3)
           m= Matrix{Float64}(undef, length(id),length(Cols))   
           for i in eachindex(Cols)
               if p3>0
                   m[:,i]= (id .+p1).*(Cols[i].+p2*p3)
               else
                   m[:,i]= (id .+p1).*(Cols[i].+p2)
               end
           end
           #ncols=Symbol.("id_c",1:length(Cols))
           ncols=Symbol.("id_",names(df,Not(:id)))

           return (;zip(ncols,eachcol(m))...)
       end
funnynt (generic function with 1 method)

julia> transform(gdf,Cols(:) => funnynt=>AsTable)
10×9 DataFrame
 Row │ id     A      B      C      D      id_A     id_B     id_C     id_D ⋯
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64  Floa ⋯
─────┼─────────────────────────────────────────────────────────────────────
   1 │     1     13    103     26     32     38.0    218.0     64.0     7 ⋯
   2 │     2     13    101     26     33     57.0    321.0     96.0    11  
   3 │     3     13    103     25     32     76.0    436.0    124.0    15  
   4 │     2     11    101     26     32     51.0    321.0     96.0    11  
   5 │     2     12    102     25     34     54.0    324.0     93.0    12 ⋯
   6 │     1     11    102     25     34     34.0    216.0     62.0     8  
   7 │     2     11    103     26     32     51.0    327.0     96.0    11  
   8 │     3     13    102     25     32     76.0    432.0    124.0    15  
   9 │     1     12    103     25     33     36.0    218.0     62.0     7 ⋯
  10 │     3     11    101     26     32     68.0    428.0    128.0    15  
                                                           1 column omitted

julia> transform(gdf,Cols(:) =>((x...)->funnynt(x...,p1=0;p3=0))=>AsTable) 
10×9 DataFrame
 Row │ id     A      B      C      D      id_A     id_B     id_C     id_D ⋯
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Float64  Floa ⋯
─────┼─────────────────────────────────────────────────────────────────────
   1 │     1     13    103     26     32     15.0    105.0     28.0     3 ⋯
   2 │     2     13    101     26     33     30.0    206.0     56.0     7  
   3 │     3     13    103     25     32     45.0    315.0     81.0    10  
   4 │     2     11    101     26     32     26.0    206.0     56.0     6  
   5 │     2     12    102     25     34     28.0    208.0     54.0     7 ⋯
   6 │     1     11    102     25     34     13.0    104.0     27.0     3  
   7 │     2     11    103     26     32     26.0    210.0     56.0     6  
   8 │     3     13    102     25     32     45.0    312.0     81.0    10  
   9 │     1     12    103     25     33     14.0    105.0     27.0     3 ⋯
  10 │     3     11    101     26     32     39.0    309.0     84.0    10  
                                                           1 column omitted

As for

I don’t know if I’m able to clarify what actually happens behind the scene.
I go a bit for analogies in my experience with Julia (I don’t read much documentation, I must confess).


nt=(f1=1,f2=2,f3=3)   # basic definition of a named tuple

nt=(;(f1=1,f2=2,f3=3)...) # the function of the splatting operator is to "open" the collection


# If you need to define a namedtuple dynamically, you need to follow these steps




fields=Symbol.('f',1:3)   # a sequence of symbols

# and

values=[1,2,3] # a sequence of values

# below various ways to input it to the syntax (;kwargs...)
# which should be treated by the parser (?) in the same way as non-positional parameters of functions!?
# But here perhaps it is better for someone who knows about these things to be more specific.
(;zip(fields,values)...)
(;Pair.(fields,values)...)
(;tuple.(fields,values)...)
1 Like

To expand a bit on @rocco_sprmnt21’s answer:

It’s a syntax for declaring named tuples. It’s not very well documented so I’ve submitted a PR.

Basically, you can write (; :a => 1, :b => 2) to create the named tuple as if you had written (a=1, b=2). The first form is more flexible: you can generate the name programmatically (as done above with Symbol("E_$suffix")), and if you want a tuple field to have the same name and same value as a variable, you don’t have to repeat yourself: (; a=a) can be shortened to (; a). I used this above in (; E, F, G).

It’s easy to try yourself to see what Julia does, for example:

julia> f(x...) = x;

julia> f(1,2,"text")
(1, 2, "text")

so if a function is declared with x, in the body the value of x is a tuple with all the arguments passed to the function (note that for keyword arguments it’s different, you get a special object that also holds the argument names).

The description above is for the case where ... is used in the function declaration, as in your funnynt(id, Cols...; p1=1, p2=2, p3=3). In the case of (; zip(ncols,eachcol(m))...) it’s different: here the splatting operator is not in a declaration, it’s more like a call. See the FAQ.

2 Likes