Working with multiple columns in DataFramesMeta, programmatically

Suppose I have the following DataFrame:

test = DataFrame(
    w = [10, 13, 5],
    a1 = [12, 9, 6],
    a2 = [9, 10, 8],
    a3 = [14, 10, 12]
)

3×4 DataFrame
 Row │ w      a1     a2     a3    
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │    10     12      9     14
   2 │    13      9     10     10
   3 │     5      6      8     12

I would like to create a new column v that is the sum of the squared differences between all of the “a” columns (i.e., a1 - a3) and w, but I want to refer to the “a” columns programmatically.

I’ve tried several variations that go something like this:

a_cols = [Symbol("a$i") for i in 1:3]

3-element Vector{Symbol}:
 :a1
 :a2
 :a3

@combine(test,
    :total = sum(:w),
    @byrow :se = sum([(AsTable(a_cols) .- :w)^2])
)

and I get:

ERROR: ArgumentError: Column references must be either all the same type or a a combination of `Symbol`s and strings
Stacktrace:
 [1] make_source_concrete(x::Vector{Any})
   @ DataFramesMeta C:\Users\mthel\.julia\packages\DataFramesMeta\BkRtJ\src\parsing.jl:398
 [2] top-level scope
   @ C:\Users\mthel\.julia\packages\DataFramesMeta\BkRtJ\src\macros.jl:1944

The desired output would be:

3×5 DataFrame
 Row │ w      a1     a2     a3     v     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │    10     12      9     14     21
   2 │    13      9     10     10     34
   3 │     5      6      8     12     59

Maybe this:

transform(test, All() => ByRow((w,as...) -> sum((a-w)^2 for a ∈ as)) => :v)

Well, it doesn’t throw an error but it doesn’t produce the desired result unfortunately:

3×5 DataFrame
 Row │ w      a1     a2     a3     v     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │    10     12      9     14    142
   2 │    13      9     10     10    475
   3 │     5      6      8     12   2975

For me it works correctly:

julia> transform(test, All() => ByRow((w,as...) -> sum((a-w)^2 for a ∈ as)) => :v)
3×5 DataFrame
 Row │ w      a1     a2     a3     v
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │    10     12      9     14     21
   2 │    13      9     10     10     34
   3 │     5      6      8     12     59

(and it looks correct)

Maybe a more careful code would be e.g.

julia> transform(test, Cols(:w, r"a\d") => ByRow((w,as...) -> sum((a-w)^2 for a ∈ as)) => :v)
3×5 DataFrame
 Row │ w      a1     a2     a3     v
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │    10     12      9     14     21
   2 │    13      9     10     10     34
   3 │     5      6      8     12     59

but it is a detail.

Regarding DataFramesMeta.jl - I think it is not possible currently there, but @pdeffebach is working on it.

On the other hand DataFrameMacros.jl should support this I think - @jules should know best.

1 Like

The second version works great. Thanks to you both!

A simple way:

df.v = sum((df[:,a] - df.w).^2 for a in Symbol.("a", 1:3))

NB: renamed test, df

2 Likes

This is a good point. Here is a less elegant solution with DataFramesMeta

julia> @combine(test,
           :total = sum(:w),
           @byrow :se = begin 
               nt = AsTable([a_cols; :w])
               v_a = collect(nt[a_cols])
               w = nt.w
               sum( (v_a .- w) .^2 )
           end
       )

Another simple solution (good mostly for this example, StatsBase needed):

test.v = 3var(Matrix(test); mean=test.w, dims=2)

(I would also assume that the variance is actually needed and then the 3 in front isn’t needed).

Does All() have a guarantee on the order of variables that are selected? I was assuming it was 1:ncol(df), but it looks more complicated than that in the source.

yes All() produces the same order as names(test).

1 Like

@bkamins thanks for the tag, indeed this is something that DataFrameMacros can do via {{ }} syntax. This makes a tuple of the referenced columns so we can do:

using DataFrameMacros
using DataFrames

test = DataFrame(
    w = [10, 13, 5],
    a1 = [12, 9, 6],
    a2 = [9, 10, 8],
    a3 = [14, 10, 12]
)

@transform test :v = sum((:w .- {{r"a"}}) .^ 2)

3×5 DataFrame
 Row │ w      a1     a2     a3     v     
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │    10     12      9     14     21
   2 │    13      9     10     10     34
   3 │     5      6      8     12     59

The contrast to { } specifiers is that those broadcast the whole expression, so if there were more ws one could do:

test2 = DataFrame(
    w1 = [10, 13, 5],
    w2 = [13, 15, 7],
    w3 = [9, 3, 10],
    a1 = [12, 9, 6],
    a2 = [9, 10, 8],
    a3 = [14, 10, 12]
)

@transform test2 "v_{}" = sum(({r"w"} .- {{r"a"}}) .^ 2)

3×9 DataFrame
 Row │ w1     w2     w3     a1     a2     a3     v_w1   v_w2   v_w3  
     │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────────────────────────────────
   1 │    10     13      9     12      9     14     21     18     34
   2 │    13     15      3      9     10     10     34     86    134
   3 │     5      7     10      6      8     12     59     27     24
3 Likes