Suppose I have the following DataFrame:
test = DataFrame(
w = [10, 13, 5],
a1 = [12, 9, 6],
a2 = [9, 10, 8],
a3 = [14, 10, 12]
)
3×4 DataFrame
Row │ w a1 a2 a3
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 10 12 9 14
2 │ 13 9 10 10
3 │ 5 6 8 12
I would like to create a new column v
that is the sum of the squared differences between all of the “a” columns (i.e., a1 - a3) and w
, but I want to refer to the “a” columns programmatically.
I’ve tried several variations that go something like this:
a_cols = [Symbol("a$i") for i in 1:3]
3-element Vector{Symbol}:
:a1
:a2
:a3
@combine(test,
:total = sum(:w),
@byrow :se = sum([(AsTable(a_cols) .- :w)^2])
)
and I get:
ERROR: ArgumentError: Column references must be either all the same type or a a combination of `Symbol`s and strings
Stacktrace:
[1] make_source_concrete(x::Vector{Any})
@ DataFramesMeta C:\Users\mthel\.julia\packages\DataFramesMeta\BkRtJ\src\parsing.jl:398
[2] top-level scope
@ C:\Users\mthel\.julia\packages\DataFramesMeta\BkRtJ\src\macros.jl:1944
The desired output would be:
3×5 DataFrame
Row │ w a1 a2 a3 v
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 10 12 9 14 21
2 │ 13 9 10 10 34
3 │ 5 6 8 12 59
Maybe this:
transform(test, All() => ByRow((w,as...) -> sum((a-w)^2 for a ∈ as)) => :v)
Well, it doesn’t throw an error but it doesn’t produce the desired result unfortunately:
3×5 DataFrame
Row │ w a1 a2 a3 v
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 10 12 9 14 142
2 │ 13 9 10 10 475
3 │ 5 6 8 12 2975
bkamins
November 30, 2022, 4:56pm
4
For me it works correctly:
julia> transform(test, All() => ByRow((w,as...) -> sum((a-w)^2 for a ∈ as)) => :v)
3×5 DataFrame
Row │ w a1 a2 a3 v
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 10 12 9 14 21
2 │ 13 9 10 10 34
3 │ 5 6 8 12 59
(and it looks correct)
Maybe a more careful code would be e.g.
julia> transform(test, Cols(:w, r"a\d") => ByRow((w,as...) -> sum((a-w)^2 for a ∈ as)) => :v)
3×5 DataFrame
Row │ w a1 a2 a3 v
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 10 12 9 14 21
2 │ 13 9 10 10 34
3 │ 5 6 8 12 59
but it is a detail.
Regarding DataFramesMeta.jl - I think it is not possible currently there, but @pdeffebach is working on it.
On the other hand DataFrameMacros.jl should support this I think - @jules should know best.
1 Like
The second version works great. Thanks to you both!
A simple way:
df.v = sum((df[:,a] - df.w).^2 for a in Symbol.("a", 1:3))
NB: renamed test
, df
2 Likes
This is a good point. Here is a less elegant solution with DataFramesMeta
julia> @combine(test,
:total = sum(:w),
@byrow :se = begin
nt = AsTable([a_cols; :w])
v_a = collect(nt[a_cols])
w = nt.w
sum( (v_a .- w) .^2 )
end
)
Dan
November 30, 2022, 6:11pm
8
Another simple solution (good mostly for this example, StatsBase needed):
test.v = 3var(Matrix(test); mean=test.w, dims=2)
(I would also assume that the variance is actually needed and then the 3
in front isn’t needed).
Does All()
have a guarantee on the order of variables that are selected? I was assuming it was 1:ncol(df)
, but it looks more complicated than that in the source.
bkamins
November 30, 2022, 6:22pm
10
yes All()
produces the same order as names(test)
.
1 Like
jules
November 30, 2022, 7:27pm
11
@bkamins thanks for the tag, indeed this is something that DataFrameMacros can do via {{ }}
syntax . This makes a tuple of the referenced columns so we can do:
using DataFrameMacros
using DataFrames
test = DataFrame(
w = [10, 13, 5],
a1 = [12, 9, 6],
a2 = [9, 10, 8],
a3 = [14, 10, 12]
)
@transform test :v = sum((:w .- {{r"a"}}) .^ 2)
3×5 DataFrame
Row │ w a1 a2 a3 v
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 10 12 9 14 21
2 │ 13 9 10 10 34
3 │ 5 6 8 12 59
The contrast to { }
specifiers is that those broadcast the whole expression, so if there were more w
s one could do:
test2 = DataFrame(
w1 = [10, 13, 5],
w2 = [13, 15, 7],
w3 = [9, 3, 10],
a1 = [12, 9, 6],
a2 = [9, 10, 8],
a3 = [14, 10, 12]
)
@transform test2 "v_{}" = sum(({r"w"} .- {{r"a"}}) .^ 2)
3×9 DataFrame
Row │ w1 w2 w3 a1 a2 a3 v_w1 v_w2 v_w3
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────
1 │ 10 13 9 12 9 14 21 18 34
2 │ 13 15 3 9 10 10 34 86 134
3 │ 5 7 10 6 8 12 59 27 24
3 Likes