I can apply a funciton to each column by broadcasting it. BUt I find the syntax of
@chain df begin
transform([:a, :b, :c] .=> (x->fn.(x)) .=> [:a, :b, :c])
end
clunky. I dont’ like (x->fn.(x))
in particular as I feel it’s somewhat inelegant. Just looking to see if there are better options.
Full MWE:
using Chain, DataFrames
df = DataFrame(a = 1:3, b=1:3, c=1:3, d = ["a", "b", "c"]
fn(x) = 2x
@chain df begin
transform([:a, :b, :c] .=> (x->fn.(x)) .=> [:a, :b, :c])
end
This is a point of introducing ByRow
:
julia> transform(df, [:a, :b, :c] .=> ByRow(fn) .=> [:a, :b, :c])
3×4 DataFrame
Row │ a b c d
│ Int64 Int64 Int64 String
─────┼─────────────────────────────
1 │ 2 2 2 a
2 │ 4 4 4 b
3 │ 6 6 6 c
julia> transform(df, [:a, :b, :c] .=> ByRow(fn), renamecols=false)
3×4 DataFrame
Row │ a b c d
│ Int64 Int64 Int64 String
─────┼─────────────────────────────
1 │ 2 2 2 a
2 │ 4 4 4 b
3 │ 6 6 6 c
(as a side benefit having ByRow
reduces compilation latency)
5 Likes
I was a little confused by ByRow
. Does it have inefficiencies or should I use it? Was confused by it for a while what is meant by “ByRow”
sijo
May 22, 2021, 12:12pm
4
To do it for all columns you can simply use mapcols
.
julia> df = DataFrame(a = 1:3, b=1:3, c=1:3);
julia> fn(x) = 2x;
julia> mapcols(fn, df)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 2
2 │ 4 4 4
3 │ 6 6 6
(Here I had to remove the d
column since fn
cannot be applied to it. I guess you’re aware of mapcols
and added the d
column on purpose, but I thought it’d be nice to mention mapcols
for others reading this thread).
sijo:
for all columns
I just wanted to do to some cols
This is not the same as what @xiaodai wanted as in this case fn
is not broadcasted.
1 Like
Simple answer: ByRow(f)(x...)
is the same as f.(x...)
.
Complex answer:
it is the same in most of the cases provided you pass vectors as arguments;
however, internally we do not use broadcasting, because broadcasting is expensive to compile;
additionally if x
is a NamedTuple
that is a Tables.jl table we use a bit different rule (preserving column names for use inside the function)
The exact rules are super simple:
(f::ByRow)(cols::AbstractVector...) = map(f.fun, cols...)
(f::ByRow)(table::NamedTuple) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]
5 Likes
sijo
May 22, 2021, 2:52pm
8
Ah good point, it only works in this example because fn(x) = 2x
and scalar multiplication on columns is equivalent to multiplying row by row.
1 Like
Yes, however, still I think ByRow
is a little easier to read in:
julia> using DataFrames
julia> df = DataFrame(rand(3,4), :auto)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────
1 │ 0.798251 0.788097 0.12371 0.447479
2 │ 0.320884 0.561217 0.736315 0.113512
3 │ 0.691074 0.807812 0.31742 0.00395885
julia> mapcols(ByRow(sin), df)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────
1 │ 0.716137 0.709013 0.123395 0.432694
2 │ 0.315405 0.532217 0.671562 0.113268
3 │ 0.637365 0.722777 0.312116 0.00395884
julia> mapcols(x -> sin.(x), df)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────
1 │ 0.716137 0.709013 0.123395 0.432694
2 │ 0.315405 0.532217 0.671562 0.113268
3 │ 0.637365 0.722777 0.312116 0.00395884
per your proposal
1 Like
It will be fast, as fast as map
. The only exception is with AsTable
in which case because it acts on a NamedTupleIterator
there might be some overhead.
1 Like