I am migrating my brain towards thinking in Julia-esque style. Yet, I am not sure how to approach this problem where I want to create a third column of a data frame that is a function of the other two.
I prefer intuitive/readable code and am wondering if I should add another method to my function, use broadcasting, use byRow, or what would be recommended. The anonymous function syntax that I see seems overly verbose, but I am open to that if that is the way.
## reproducible Example
using DataFrames
## define function of x and y
function foo(x::Float64,y::Float64)
2*x*y
end
## create data frame with three columns
DataFrame(colX = rand(3),
colY = rand(3),
colZ = foo.(:colX,:colY))
# ERROR: LoadError: MethodError: no method matching foo(::Symbol, ::Symbol)
How would you rewrite the above in a way that works and is highly readable. My point of comparison is this R code:
library(dplyr)
foo = function(x,y) {
2 * x * y
}
tibble(x = runif(3),
y = runif(3),
z = foo(x,y))
1 Like
With DataFrames.jl, you canβt reference other columns by name during construction, and in general donβt have the non-standard evaluation that dpylr uses. Two options for writing would be
# 1. make the vectors first
y = rand(3)
x = rand(3)
df = DataFrame(x=x, y=y, z=foo.(x, y))
# 2. add the column after the fact
df = DataFrame(x=rand(3), y=rand(3))
df.z = foo.(df.x, df.y)
2 Likes
I find that DataFramesMeta gives me the closest experience to dplyr
:
using DataFrames, DataFramesMeta
foo(x, y) = 2 * x * y
df = DataFrame(x=randn(10), y=randn(10))
# using the @transform macro
@transform(df, z = foo.(:x, :y))
# ...or chaining operations using @linq
@linq df |> transform(z = foo.(:x, :y))
3 Likes
I think DataFramesMeta + Chain.jl gives the most dply-like experiience
julia> df = DataFrame(x=randn(10), y=randn(10));
julia> @chain df begin
@transform(y = foo.(:x, :y))
@transform(g = rand(0:1, length(:y)))
groupby(:g)
@combine(y_mean = mean(:y), x_mean = mean(:x))
end
2Γ3 DataFrame
Row β g y_mean x_mean
β Int64 Float64 Float64
ββββββΌββββββββββββββββββββββββββββ
1 β 0 0.967425 -0.513779
2 β 1 1.68032 -0.906092
4 Likes
I like this paradigm the best. Very readable and melds well with my brain.
I do like #2 alot, but really like the chaining workflow of dplyr as presented below. Thx.