Where is the input column name information in a dataframe transformation?

rocco_sprmnt21 · December 21, 2022, 7:55pm

suppose I want to use some specific columns of a dataframe to do some operations and output another number of columns.
In dynamically defining the name of these columns I want to use the name of the source columns.

In the following case I can do it because after fun=> I have the input names available.

transform(df, Cols([2,3,1,5])=>fun=>x->"new_".*x[4:-1:2])

But in the following case if one want to use the information inside the funcols() function to produce a namedtuple for AsTable, how do you do it?

transform(df, Cols([2,3,1,5])=>funcols=>AsTable)

pdeffebach · December 21, 2022, 9:13pm

You have to write funcols to return a NamedTuple and work with the names inside funcols. It all happens inside funcols and not in the destination.

rocco_sprmnt21 · December 21, 2022, 9:30pm

inside funcols I have the column data but not the names. I wish I had the names available to use in creating the new names.
I know you can achieve the same result in other ways, but I wanted to know if this is somehow possible as well

pdeffebach · December 21, 2022, 10:55pm

No I don’t think so.

AsTable(Cols([2,3,1,5])) => funcols => AsTable

rocco_sprmnt21 · December 22, 2022, 10:10am

I was aware of this possibility. What I’m missing is understanding how (and why) it is possible to refer (within the same transform) to the names of the input columns after the second “=>” and not also after the first “=>”.
Just to better understand some internal mechanisms of DataFRames, not because I have a particular need for this “feature”.

sijo · December 22, 2022, 12:01pm

The reason is that with cols => f => g, DataFrames will call f with the column values, and g with the column names. It’s just how the API works. If you want to receive the names in the f function you can write AsTable(cols) => f. In this case DataFrames will pass to f the columns as a named tuple (the keys are the column names, the values are the column values). Note that in this case all the columns are passed as a single argument:

using DataFrames

df = DataFrame(a=1:3, b=4:6)

function f(table)
    # Add the (first) two columns of table
    result = table[1] + table[2]
    
    # Make the name "a+b" from column names "a" and "b"
    names = keys(table)
    result_name = Symbol(names[1], "+", names[2])
    return (; result_name => result)
end

transform(df, AsTable([:a, :b]) => f => AsTable)

# Output:

3×3 DataFrame
 Row │ a      b      a+b   
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      4      5
   2 │     2      5      7
   3 │     3      6      9

rocco_sprmnt21 · December 22, 2022, 6:53pm

As mentioned in previous posts, I know of other ways to get the result, including passing namedtuples via AsTable.

In this case I would do so

transform(df, [:a, :b] => (+) => x->join(x,"+"))

what I was trying to know is if it was somehow possible to use, even when cols=Array{Symbol}, the column names inside the func and not just in the output naming context.

Since after the second “=>” I can define a function x->secondfun(colsnames) that uses the column names, I was trying to imagine IF and HOW I could use the same information (the names) in the context of the first function.
I know that currently the APIs work the way they do, but I just wanted to poke around behind the scene, without going through the code which is very large and complex.

pdeffebach · December 22, 2022, 6:58pm

It seems like we are going in circles. The answer is simply no, the API is not constructed in that way. AsTable only knows about the named-tuple passed to it. It doesn’t know about src in any way in the src => fun => dest expression. There is a way to do what you want, and we have described it above.

pdeffebach · December 22, 2022, 7:27pm

You could do a wrapper function around fun.

fix_names(nt, src) = ... fix column names based on src... returns a new named tuple
src => (t -> fix_names(fun(t), src)) => AsTable

rocco_sprmnt21 · December 22, 2022, 9:01pm

It looks like Columbus’s egg, but the answer to my question could be like this, except that you have to use an external variable

PS
We weren’t going in circles but following a spiral we might eventually get to (or get close to) some point

src=[:a, :b]
aplusb(src, x)= (;zip(Symbol.([join(src,"+")]), [+(x...)])...)
transform(df,  src=> ((x...)->aplusb(src,x)) => AsTable)