I’d like to apply a function to each row of a DataFrame and have it add columns based on the output names. I would use transform-ByRow but I’d like to avoid restating the column names on input and output. Is there anything more elegant than the below?
function f(; a=0, b=0, _... )
x = a + b
y = a - b
(; x, y)
end
D = DataFrame( a=1:5, b=11:15, C = 1:5 )
@pipe eachrow(D) .|> (; _..., f(; _... )... ) |> DataFrame
“AsTable” is really cool thank you
It automatically maps the output column names.
This example still restates the input columns though. In practice I have a CSV file or DB table with 30 columns. My function uses 10 of them (by-name) and adds another 10 to the output table.
I’d like to avoid naming the 10 columns - provided the function parameter names match the table field names I’d like it to match them automatically.
I am thinking really hard about how to avoid input names. But it’s hard cos you can have multiple definitions of f possibly with different argument names! So on the surface, I don’t think it’s generally possible nor a good idea.
But this requires positional arguments function f(a,b)
I’d like named arguments function f(; a, b, _...)
Within the 30 column table, the function might use columns 1,3,7,10,11,19,20,24,29
I’d like these automatically matched on name.
Though I take xiaodai’s point. If you have multiple versions of a function with different arguments it would be confusing.
The function might have 100 lines. a + b is much clearer than df.a .+ df.b especially when there are 10 terms in the equation.
Within the function I’d like to think purely in terms of the variables (not the data-frame columns)
I’d like to be able to call the function directly as well as apply it to the DataFrame.
Thanks xiaodai.
Personally, I don’t create multiple versions of the same function. But I know its a much more important feature than what I’m suggesting. In any case, I can achieve the result with:
what if it just gave an error if multiple versions of f are defined.
So to use the feature the programmer restricts themselves to a single version - not a bad trade off.
Just to note with the syntax, AsTable(:) => fun => ..., fun does not get a AbstractDataFrame. Rather, it gets a NamedTuple of vectors. And in AsTable(:) => ByRow(fun) => ..., fun gets a named tuple.
The reason is because we want src => fun => dest to be very performant and type stable, which we can’t be if we just pass a DataFrame to fun.
This is definitely confusing, but there’s always the ability to define your own function
julia> df = DataFrame(a = [5, 6], b = [7, 8]);
julia> function maprowsdf(f, df)
map(eachrow(df)) do r
nt = NamedTuple(r)
res = f(; nt...)
merge(nt, res)
end |> DataFrame
end;
julia> maprowsdf(foo, df)
2×4 DataFrame
Row │ a b c d
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 5 7 12 35
2 │ 6 8 14 48
Though I think transform with ByRow is still better.