Rewriting dplyr code which uses a function of columns in Julia -style using DataFrames.jl

Adam_Fleischhacker · March 25, 2021, 3:35pm

I am migrating my brain towards thinking in Julia-esque style. Yet, I am not sure how to approach this problem where I want to create a third column of a data frame that is a function of the other two.
I prefer intuitive/readable code and am wondering if I should add another method to my function, use broadcasting, use byRow, or what would be recommended. The anonymous function syntax that I see seems overly verbose, but I am open to that if that is the way.

## reproducible Example
using DataFrames

## define function of x and y
function foo(x::Float64,y::Float64)
    2*x*y
end

## create data frame with three columns
DataFrame(colX = rand(3),
        colY = rand(3),
        colZ = foo.(:colX,:colY))

# ERROR: LoadError: MethodError: no method matching foo(::Symbol, ::Symbol)

How would you rewrite the above in a way that works and is highly readable. My point of comparison is this R code:

library(dplyr)
foo = function(x,y) {
  2 * x * y
}

tibble(x = runif(3),
       y = runif(3),
       z = foo(x,y))

chris-b1 · March 25, 2021, 3:55pm

With DataFrames.jl, you can’t reference other columns by name during construction, and in general don’t have the non-standard evaluation that dpylr uses. Two options for writing would be

# 1.  make the vectors first
y = rand(3)
x = rand(3)
df = DataFrame(x=x, y=y, z=foo.(x, y))

# 2. add the column after the fact
df = DataFrame(x=rand(3), y=rand(3))
df.z = foo.(df.x, df.y)

ElOceanografo · March 25, 2021, 4:45pm

I find that DataFramesMeta gives me the closest experience to dplyr:

using DataFrames, DataFramesMeta
foo(x, y) = 2 * x * y
df = DataFrame(x=randn(10), y=randn(10))
# using the @transform macro 
@transform(df, z = foo.(:x, :y))
# ...or chaining operations using @linq
@linq df |> transform(z = foo.(:x, :y))

pdeffebach · March 25, 2021, 5:17pm

I think DataFramesMeta + Chain.jl gives the most dply-like experiience

julia> df = DataFrame(x=randn(10), y=randn(10));

julia> @chain df begin 
           @transform(y = foo.(:x, :y))
           @transform(g = rand(0:1, length(:y)))
           groupby(:g)
           @combine(y_mean = mean(:y), x_mean = mean(:x))
       end
2×3 DataFrame
 Row │ g      y_mean    x_mean    
     │ Int64  Float64   Float64   
─────┼────────────────────────────
   1 │     0  0.967425  -0.513779
   2 │     1  1.68032   -0.906092

Adam_Fleischhacker · March 25, 2021, 5:46pm

I like this paradigm the best. Very readable and melds well with my brain.

Adam_Fleischhacker · March 25, 2021, 5:49pm

I do like #2 alot, but really like the chaining workflow of dplyr as presented below. Thx.

Topic		Replies	Views
DataFrames.jl - Vectorized row-wise function application Data dataframes	3	3840	October 13, 2018
Efficient way to add column to dataframe computed from prior columns New to Julia question	6	2309	August 12, 2021
Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl New to Julia dataframes	9	1090	May 22, 2021
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10572	April 22, 2022
DataFrames not showing value Data	11	543	March 9, 2020

Rewriting dplyr code which uses a function of columns in Julia -style using DataFrames.jl

Related topics