Rewriting dplyr code which uses a function of columns in Julia -style using DataFrames.jl

I am migrating my brain towards thinking in Julia-esque style. Yet, I am not sure how to approach this problem where I want to create a third column of a data frame that is a function of the other two.
I prefer intuitive/readable code and am wondering if I should add another method to my function, use broadcasting, use byRow, or what would be recommended. The anonymous function syntax that I see seems overly verbose, but I am open to that if that is the way.

## reproducible Example
using DataFrames

## define function of x and y
function foo(x::Float64,y::Float64)

## create data frame with three columns
DataFrame(colX = rand(3),
        colY = rand(3),
        colZ = foo.(:colX,:colY))

# ERROR: LoadError: MethodError: no method matching foo(::Symbol, ::Symbol)

How would you rewrite the above in a way that works and is highly readable. My point of comparison is this R code:

foo = function(x,y) {
  2 * x * y

tibble(x = runif(3),
       y = runif(3),
       z = foo(x,y))
1 Like

With DataFrames.jl, you can’t reference other columns by name during construction, and in general don’t have the non-standard evaluation that dpylr uses. Two options for writing would be

# 1.  make the vectors first
y = rand(3)
x = rand(3)
df = DataFrame(x=x, y=y, z=foo.(x, y))

# 2. add the column after the fact
df = DataFrame(x=rand(3), y=rand(3))
df.z = foo.(df.x, df.y)


I find that DataFramesMeta gives me the closest experience to dplyr:

using DataFrames, DataFramesMeta
foo(x, y) = 2 * x * y
df = DataFrame(x=randn(10), y=randn(10))
# using the @transform macro 
@transform(df, z = foo.(:x, :y))
# ...or chaining operations using @linq
@linq df |> transform(z = foo.(:x, :y))

I think DataFramesMeta + Chain.jl gives the most dply-like experiience

julia> df = DataFrame(x=randn(10), y=randn(10));

julia> @chain df begin 
           @transform(y = foo.(:x, :y))
           @transform(g = rand(0:1, length(:y)))
           @combine(y_mean = mean(:y), x_mean = mean(:x))
2Γ—3 DataFrame
 Row β”‚ g      y_mean    x_mean    
     β”‚ Int64  Float64   Float64   
   1 β”‚     0  0.967425  -0.513779
   2 β”‚     1  1.68032   -0.906092

I like this paradigm the best. Very readable and melds well with my brain.

I do like #2 alot, but really like the chaining workflow of dplyr as presented below. Thx.