Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl

xiaodai · May 22, 2021, 10:44am

I can apply a funciton to each column by broadcasting it. BUt I find the syntax of

@chain df begin
    transform([:a, :b, :c]  .=> (x->fn.(x)) .=> [:a, :b, :c])
end

clunky. I dont’ like (x->fn.(x)) in particular as I feel it’s somewhat inelegant. Just looking to see if there are better options.

Full MWE:

using Chain, DataFrames
df = DataFrame(a = 1:3, b=1:3, c=1:3, d = ["a", "b", "c"]

fn(x) = 2x

@chain df begin
    transform([:a, :b, :c]  .=> (x->fn.(x)) .=> [:a, :b, :c])
end

bkamins · May 22, 2021, 10:46am

This is a point of introducing ByRow:

julia> transform(df, [:a, :b, :c]  .=> ByRow(fn) .=> [:a, :b, :c])
3×4 DataFrame
 Row │ a      b      c      d      
     │ Int64  Int64  Int64  String 
─────┼─────────────────────────────
   1 │     2      2      2  a
   2 │     4      4      4  b
   3 │     6      6      6  c

julia> transform(df, [:a, :b, :c]  .=> ByRow(fn), renamecols=false)
3×4 DataFrame
 Row │ a      b      c      d      
     │ Int64  Int64  Int64  String 
─────┼─────────────────────────────
   1 │     2      2      2  a
   2 │     4      4      4  b
   3 │     6      6      6  c

(as a side benefit having ByRow reduces compilation latency)

xiaodai · May 22, 2021, 10:48am

I was a little confused by ByRow. Does it have inefficiencies or should I use it? Was confused by it for a while what is meant by “ByRow”

sijo · May 22, 2021, 12:12pm

To do it for all columns you can simply use mapcols.

julia> df = DataFrame(a = 1:3, b=1:3, c=1:3);

julia> fn(x) = 2x;

julia> mapcols(fn, df)
3×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     2      2      2
   2 │     4      4      4
   3 │     6      6      6

(Here I had to remove the d column since fn cannot be applied to it. I guess you’re aware of mapcols and added the d column on purpose, but I thought it’d be nice to mention mapcols for others reading this thread).

xiaodai · May 22, 2021, 12:14pm

I just wanted to do to some cols

bkamins · May 22, 2021, 1:14pm

This is not the same as what @xiaodai wanted as in this case fn is not broadcasted.

bkamins · May 22, 2021, 1:18pm

Simple answer: ByRow(f)(x...) is the same as f.(x...).

Complex answer:

it is the same in most of the cases provided you pass vectors as arguments;
however, internally we do not use broadcasting, because broadcasting is expensive to compile;
additionally if x is a NamedTuple that is a Tables.jl table we use a bit different rule (preserving column names for use inside the function)

The exact rules are super simple:

(f::ByRow)(cols::AbstractVector...) = map(f.fun, cols...)
(f::ByRow)(table::NamedTuple) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]

sijo · May 22, 2021, 2:52pm

Ah good point, it only works in this example because fn(x) = 2x and scalar multiplication on columns is equivalent to multiplying row by row.

bkamins · May 22, 2021, 3:10pm

Yes, however, still I think ByRow is a little easier to read in:

julia> using DataFrames

julia> df = DataFrame(rand(3,4), :auto)
3×4 DataFrame
 Row │ x1        x2        x3        x4
     │ Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────
   1 │ 0.798251  0.788097  0.12371   0.447479
   2 │ 0.320884  0.561217  0.736315  0.113512
   3 │ 0.691074  0.807812  0.31742   0.00395885

julia> mapcols(ByRow(sin), df)
3×4 DataFrame
 Row │ x1        x2        x3        x4
     │ Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────
   1 │ 0.716137  0.709013  0.123395  0.432694
   2 │ 0.315405  0.532217  0.671562  0.113268
   3 │ 0.637365  0.722777  0.312116  0.00395884

julia> mapcols(x -> sin.(x), df)
3×4 DataFrame
 Row │ x1        x2        x3        x4
     │ Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────
   1 │ 0.716137  0.709013  0.123395  0.432694
   2 │ 0.315405  0.532217  0.671562  0.113268
   3 │ 0.637365  0.722777  0.312116  0.00395884

per your proposal

pdeffebach · May 22, 2021, 4:36pm

It will be fast, as fast as map. The only exception is with AsTable in which case because it acts on a NamedTupleIterator there might be some overhead.

Topic		Replies	Views
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	545	December 7, 2022
Transform multiple columns of a DataFrame using the same function Data dataframes	12	4070	January 23, 2023
Column wise broadcast for a matrix New to Julia	3	3215	May 24, 2018
Is there a way to update a column of a dataframe to an array in julia? General Usage	4	916	January 30, 2019
Having problems to modifying a DataFrame using a for loop and eachcol New to Julia question	2	319	June 21, 2022

Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl

Related topics