Apply function By Row without re-stating column names

I’d like to apply a function to each row of a DataFrame and have it add columns based on the output names. I would use transform-ByRow but I’d like to avoid restating the column names on input and output. Is there anything more elegant than the below?

function f(; a=0, b=0, _... ) 
    x = a + b    
    y = a - b
    (; x, y)

D = DataFrame(  a=1:5,   b=11:15,  C = 1:5  )

@pipe eachrow(D) .|> (; _..., f(; _... )... ) |> DataFrame
transform(D, [:a, :b] => ByRow((x,y) -> f(a=x, b=y)) => AsTable)

or change your function signature to function f(a,b) and then use

transform(D, [:a, :b] => ByRow(f) => AsTable)

“AsTable” is really cool thank you :slight_smile:
It automatically maps the output column names.

This example still restates the input columns though. In practice I have a CSV file or DB table with 30 columns. My function uses 10 of them (by-name) and adds another 10 to the output table.
I’d like to avoid naming the 10 columns - provided the function parameter names match the table field names I’d like it to match them automatically.

This is really cool. I don’t think it’s possible with DataFrameMacros.jl just yet

Byrow(x-> f(x...))


Just a guess, or something along those lines?

Every ByRow example I’ve seen requires the input columns to be specified

I am thinking really hard about how to avoid input names. But it’s hard cos you can have multiple definitions of f possibly with different argument names! So on the surface, I don’t think it’s generally possible nor a good idea.

You can do AsTable(:) => ByRow(f) => AsTable

But this will be awkward (large compile times) with many many columns. 30 is doable, though.

Do you have a full example?
This doesn’t work for me.
The closest thing is

transform(D, AsTable(:) => ByRow( x-> f(x...)) => AsTable)

But this requires positional arguments function f(a,b)
I’d like named arguments function f(; a, b, _...)
Within the 30 column table, the function might use columns 1,3,7,10,11,19,20,24,29
I’d like these automatically matched on name.

Though I take xiaodai’s point. If you have multiple versions of a function with different arguments it would be confusing.

is it not possible to make your function accept a dataframe instead?


using DataFrames
df = DataFrame(a = 1:10, b = 11:20)

function f(df)
  (x = df.a .+  df.b, )

transform(df, AsTable(:) => f => AsTable)
1 Like

The function might have 100 lines.
a + b is much clearer than df.a .+ df.b especially when there are 10 terms in the equation.

Within the function I’d like to think purely in terms of the variables (not the data-frame columns)
I’d like to be able to call the function directly as well as apply it to the DataFrame.

Ok. then for this reason, I think it will be hard to achieve.

Thanks xiaodai.
Personally, I don’t create multiple versions of the same function. But I know its a much more important feature than what I’m suggesting. In any case, I can achieve the result with:

@pipe eachrow(D) .|> (; _..., f(; _... )... ) |> DataFrame

Its just not readable.

what if it just gave an error if multiple versions of f are defined.
So to use the feature the programmer restricts themselves to a single version - not a bad trade off.

You were really close there, just missing a semi-colon

julia> using DataFrames

julia> function foo(;a = 1, b=1, kwargs...)
           (c = a + b, d = a * b)

julia> df = DataFrame(a = [5], b = [6]);

julia> transform(df, AsTable(:) => ByRow(t -> foo(; t...)) => AsTable)
1×4 DataFrame
 Row │ a      b      c      d     
     │ Int64  Int64  Int64  Int64 
   1 │     5      6     11     30

Just to note with the syntax, AsTable(:) => fun => ..., fun does not get a AbstractDataFrame. Rather, it gets a NamedTuple of vectors. And in AsTable(:) => ByRow(fun) => ..., fun gets a named tuple.

The reason is because we want src => fun => dest to be very performant and type stable, which we can’t be if we just pass a DataFrame to fun.

This is definitely confusing, but there’s always the ability to define your own function

julia> df = DataFrame(a = [5, 6], b = [7, 8]);

julia> function maprowsdf(f, df)
           map(eachrow(df)) do r
               nt = NamedTuple(r)
               res = f(; nt...)
               merge(nt, res)
           end |> DataFrame

julia> maprowsdf(foo, df)
2×4 DataFrame
 Row │ a      b      c      d     
     │ Int64  Int64  Int64  Int64 
   1 │     5      7     12     35
   2 │     6      8     14     48

Though I think transform with ByRow is still better.


When do I want AsTable vs AsTable(:)? Is there documentation on this?

The most complete docs are probably here. But in general you want AsTable(:) in the src and AsTable in the dest for a src => fun => dest command.

1 Like

Thanks so much for your help Peter. Underscores package helps with readability as well.
I like both of these options.

transform(D, @_ AsTable(:) => f(;_...) |> ByRow => AsTable)

@_ eachrow(D) .|> merge(_,f(;_...)) |> DataFrame

Can AsTable( : ) work within the @rtransform macro ?
Something like ?

myfunc(; A, B) = A + B

@chain begin
    @rtransform :X = myfunc( AsTable(:)... )