Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl

I can apply a funciton to each column by broadcasting it. BUt I find the syntax of

@chain df begin
    transform([:a, :b, :c]  .=> (x->fn.(x)) .=> [:a, :b, :c])
end

clunky. I dont’ like (x->fn.(x)) in particular as I feel it’s somewhat inelegant. Just looking to see if there are better options.

Full MWE:

using Chain, DataFrames
df = DataFrame(a = 1:3, b=1:3, c=1:3, d = ["a", "b", "c"]

fn(x) = 2x

@chain df begin
    transform([:a, :b, :c]  .=> (x->fn.(x)) .=> [:a, :b, :c])
end

This is a point of introducing ByRow:

julia> transform(df, [:a, :b, :c]  .=> ByRow(fn) .=> [:a, :b, :c])
3×4 DataFrame
 Row │ a      b      c      d      
     │ Int64  Int64  Int64  String 
─────┼─────────────────────────────
   1 │     2      2      2  a
   2 │     4      4      4  b
   3 │     6      6      6  c

julia> transform(df, [:a, :b, :c]  .=> ByRow(fn), renamecols=false)
3×4 DataFrame
 Row │ a      b      c      d      
     │ Int64  Int64  Int64  String 
─────┼─────────────────────────────
   1 │     2      2      2  a
   2 │     4      4      4  b
   3 │     6      6      6  c

(as a side benefit having ByRow reduces compilation latency)

5 Likes

I was a little confused by ByRow. Does it have inefficiencies or should I use it? Was confused by it for a while what is meant by “ByRow”

To do it for all columns you can simply use mapcols.

julia> df = DataFrame(a = 1:3, b=1:3, c=1:3);

julia> fn(x) = 2x;

julia> mapcols(fn, df)
3×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     2      2      2
   2 │     4      4      4
   3 │     6      6      6

(Here I had to remove the d column since fn cannot be applied to it. I guess you’re aware of mapcols and added the d column on purpose, but I thought it’d be nice to mention mapcols for others reading this thread).

I just wanted to do to some cols

This is not the same as what @xiaodai wanted as in this case fn is not broadcasted.

1 Like

Simple answer: ByRow(f)(x...) is the same as f.(x...).

Complex answer:

  • it is the same in most of the cases provided you pass vectors as arguments;
  • however, internally we do not use broadcasting, because broadcasting is expensive to compile;
  • additionally if x is a NamedTuple that is a Tables.jl table we use a bit different rule (preserving column names for use inside the function)

The exact rules are super simple:

(f::ByRow)(cols::AbstractVector...) = map(f.fun, cols...)
(f::ByRow)(table::NamedTuple) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]
5 Likes

Ah good point, it only works in this example because fn(x) = 2x and scalar multiplication on columns is equivalent to multiplying row by row.

1 Like

Yes, however, still I think ByRow is a little easier to read in:

julia> using DataFrames

julia> df = DataFrame(rand(3,4), :auto)
3×4 DataFrame
 Row │ x1        x2        x3        x4
     │ Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────
   1 │ 0.798251  0.788097  0.12371   0.447479
   2 │ 0.320884  0.561217  0.736315  0.113512
   3 │ 0.691074  0.807812  0.31742   0.00395885

julia> mapcols(ByRow(sin), df)
3×4 DataFrame
 Row │ x1        x2        x3        x4
     │ Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────
   1 │ 0.716137  0.709013  0.123395  0.432694
   2 │ 0.315405  0.532217  0.671562  0.113268
   3 │ 0.637365  0.722777  0.312116  0.00395884

julia> mapcols(x -> sin.(x), df)
3×4 DataFrame
 Row │ x1        x2        x3        x4
     │ Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────
   1 │ 0.716137  0.709013  0.123395  0.432694
   2 │ 0.315405  0.532217  0.671562  0.113268
   3 │ 0.637365  0.722777  0.312116  0.00395884

per your proposal

1 Like

It will be fast, as fast as map. The only exception is with AsTable in which case because it acts on a NamedTupleIterator there might be some overhead.

1 Like