Apply function By Row without re-stating column names

Thanks xiaodai.
Personally, I don’t create multiple versions of the same function. But I know its a much more important feature than what I’m suggesting. In any case, I can achieve the result with:

@pipe eachrow(D) .|> (; _..., f(; _... )... ) |> DataFrame

Its just not readable.

what if it just gave an error if multiple versions of f are defined.
So to use the feature the programmer restricts themselves to a single version - not a bad trade off.

You were really close there, just missing a semi-colon

julia> using DataFrames

julia> function foo(;a = 1, b=1, kwargs...)
           (c = a + b, d = a * b)
       end;

julia> df = DataFrame(a = [5], b = [6]);

julia> transform(df, AsTable(:) => ByRow(t -> foo(; t...)) => AsTable)
1×4 DataFrame
 Row │ a      b      c      d     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     5      6     11     30
2 Likes

Just to note with the syntax, AsTable(:) => fun => ..., fun does not get a AbstractDataFrame. Rather, it gets a NamedTuple of vectors. And in AsTable(:) => ByRow(fun) => ..., fun gets a named tuple.

The reason is because we want src => fun => dest to be very performant and type stable, which we can’t be if we just pass a DataFrame to fun.

This is definitely confusing, but there’s always the ability to define your own function

julia> df = DataFrame(a = [5, 6], b = [7, 8]);

julia> function maprowsdf(f, df)
           map(eachrow(df)) do r
               nt = NamedTuple(r)
               res = f(; nt...)
               merge(nt, res)
           end |> DataFrame
       end;

julia> maprowsdf(foo, df)
2×4 DataFrame
 Row │ a      b      c      d     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     5      7     12     35
   2 │     6      8     14     48

Though I think transform with ByRow is still better.

2 Likes

When do I want AsTable vs AsTable(:)? Is there documentation on this?

The most complete docs are probably here. But in general you want AsTable(:) in the src and AsTable in the dest for a src => fun => dest command.

1 Like

Thanks so much for your help Peter. Underscores package helps with readability as well.
I like both of these options.

transform(D, @_ AsTable(:) => f(;_...) |> ByRow => AsTable)

@_ eachrow(D) .|> merge(_,f(;_...)) |> DataFrame

Can AsTable( : ) work within the @rtransform macro ?
Something like ?

myfunc(; A, B) = A + B

@chain begin
    DataFrame(A=1:2,B=2:3)
    @rtransform :X = myfunc( AsTable(:)... )
end

Just a slight variation of what is already present in this very interesting discussion


function f2(t) 
    x = t.a + t.b    
    y = t.a - t.b
    (;x,y, t...)
end


select(D, AsTable(:)=>ByRow(f2)=>AsTable)

this variant if you need speed

DataFrame(map(f2, Tables.namedtupleiterator(D)))

This works on master, but we have not made a release yet with that update. I wanted to add one more feature, support for keyword arguments, but that is taking longer than I thought.

3 Likes

This is what I love about Julia. Always a new feature I can’t wait to see :slight_smile:

1 Like

What about an @rtransform equivalent of => AsTable
So the columns added or altered are given by the items of the named tuple returned by the function.

function myfunc(; a, b, kwargs... )
    x = a+b
    y = a-b
    (;x, y)
end

@chain begin
    DataFrame( a=1:2, b=3:4 )

    @rtransform         :AsTable = myfunc(; AsTable(:)... )
end

That exists! Just do

@rtransform $AsTable = mufunc(...)

It’s insufficiently documented, which is changing in the future. But there is an example in the docs here.

very cool, thank you :slight_smile:

I notice you can’t use $AsTable = within an @rtransform @astable block.

This would be a nice feature.

Can you clarify what that would look like? Do you mean merging selected names and the programmatically generated names from $AsTable? Fwiw I think this would be a very hard feature to implement, so I’m interested to hear your use case and proposal.

In the below example, column B is added directly, then myfunc with $AsTable adds columns C and D then column E is added directly.

In my actual code, 5 or 10 columns are added each time. Because $AsTable doesn’t work within an @ratransform @astable block I have to do it as 3 seperate @rtransform blocks.

I’d prefer all the row level operations on the table to be within a single @rtransform block.

function myfunc(; A, B )
    C = A + B
    D = A - B
    (; C, D)
end

@chain begin
    DataFrame( A=1:2 )

    @rtransform @astable begin
        :B = :A + 1      
        $AsTable = myfunc(; AsTable... )
        :E = :C + :D
    end
end

A more extreme idea (probably too difficult). A block similar to @chain but allows for row level operations (as if you are within @rtransform @astable) except where other macros ( e.g. @rsubset, @orderby) are being used.

@ChainWithrTransform DataFrame( A = 1:10 ) begin

    :B =  mod(:A,3)
    :E =  :B * 2

    @rsubset :B == 1
    
    :F =  :B * 2

    @orderby :A


    $AsTable = myfunc(; AsTable... )

    :G =  :D * 2

end

Maybe. then this new @chain-like macro would have to live in DataFramesMeta.jl, rather than Chain.jl.

Another consideration is what :B = mod(:A, 3) would do. It seems costly to copy a new data frame every time, maybe it could do @rtransform! behind the scenes?

Can you please file an issue?

1 Like