How can I make this DataFramesMeta operation work?

I have a DataFrame that has a PWGTP column, and then 80 additional columns named PWGTP1PWGTP80. I’m trying to do the following:

total = @chain vt18 begin
	@rsubset(
		:AGEP in 21:64,
		:DIS == 1
	)
	@combine(
		:total = sum(:PWGTP),
        :variance = (4/80) * sum([(sum(Symbol("PWGTP$i")) - sum(:PWGTP))^2 for i in 1:80])
	)
end

This produces the following error:

ERROR: MethodError: no method matching iterate(::Symbol)

I found some old code where I implemented this with JuliaDBMeta like so:

variance = (4/80) * sum([(sum(column(_, Symbol("PWGTP$n"))) - sum(:PWGTP))^2 for n in 1:80]),

I’m struggling to find the DataFramesMeta solution…

Can you use the names function to get a vector of the column names that you want to use, then iterate over those?

1 Like

This appears to work:

w_cols = vcat("PWGTP", ["PWGTP$i" for i in 1:80])

@chain vt18 begin
	@rsubset(
		:AGEP in 21:64,
		:DIS == 1
	)
	@combine(
		:total = sum(:PWGTP),
		:variance = (4/80) * sum([
			(sum(AsTable(w_cols)[i]) - sum(AsTable(w_cols)[1]))^2 for i in 2:80
		])
	)
end

…I still need to double-check the results though to make sure it’s doing what I think it’s doing…

  1. For interpolation, you want to use $, not Symbol. See documentation here.
  2. In general, this operation isn’t really possible in DataFramesMeta. The issue is that $"PWGTP$i" can only refer to one column. and i has to be defined before the macro is expanded.
julia> df = DataFrame(a1 = rand(10), a2 = rand(10), b = rand(10));

julia> @transform df :c = begin
           i = 1
           $"a$i"
       end
ERROR: UndefVarError: `i` not defined

Here is one solution that appears to work

julia> df = DataFrame(a1 = rand(10), a2 = rand(10), b = rand(10));

julia> function getsum(df)
           (newsum = sum(sum(df[!, "a$i"]) - sum(df[!, "b"]) for i in 1:2),)
       end;

julia> @chain df begin
           combine(getsum, _)
       end
1×1 DataFrame
 Row │ newsum  
     │ Float64 
─────┼─────────
   1 │ 2.46766

And finally (as you have already written, AsTable works well here too.

julia> function getsum(nt::NamedTuple)
           sum(sum(nt[Symbol("a$i")]) - sum(nt[Symbol("b")]) for i in 1:2)
       end;

julia> cols = [:a1, :a2, :b];

julia> @chain df begin
           @combine :newsum = getsum(AsTable(cols))
       end
1×1 DataFrame
 Row │ newsum  
     │ Float64 
─────┼─────────
   1 │ 2.46766

Note: There is a bit of a compile cost when using AsTable because it creates a NamedTuple (and thus specializes on the length and names of inputs). But 80 isn’t that many columns so it should be fine.

Finally, note that there is an “escape hatch” to work with the whole data frame inside combine. In transform, if you give just a function, on its own, as an argument, the whole DataFrame (or SubDataFrame, in the case of a grouped operation) is passed. So you can combine both solutions so far

julia> @chain df begin
           @combine begin
               $(getsum)
               :newsum2 = getsum(AsTable(cols))
           end
       end
1×2 DataFrame
 Row │ newsum   newsum2 
     │ Float64  Float64 
─────┼──────────────────
   1 │ 2.46766  2.46766
1 Like