JuliaDBMeta/JuliaDB - How to select columns dynamically/programatically

I’ve been beating my head against the wall for hours on end trying to figure out how to avoid hard-coding column names in the below example (specifically, in the array comprehension wgts = [sum(cols(Symbol("PWGTP$i"))) for i in 1:80]).

function test(state::Int64, occ::Int64)
       @applychunked tbl begin
            @where !ismissing(:OCCP) &&
            :ST == state &&
            (:ESR == 1 || :ESR == 2) &&
            :OCCP == occ
            @groupby _ :PUMA { total = sum(:PWGTP), wgts = [sum(cols(Symbol("PWGTP$i"))) for i in 1:80] }
        end
    end

I get an error when doing this: UndefVarError: i not defined. However, it doesn’t appear that you can select columns by their index (at least not in this context) so I don’t know how I can perform operations such as the above one without having to write an insanely long line of code with all 80 variables.

In the data, there are columns :PWGTP, :PWGTP1, ... ,:PWGTP80 and I need to be able to sum the values in columns :PWGTP1, ... , :PWGTP80, kind of like I’ve done with the total = sum(:PWGTP) piece, just without having to write sum() for all 80 columns…

I’m hoping someone familiar with JuliaDB/JuliaDBMeta can help.

Thanks!

I would first sum up the columns:

syms = [Symbol("PWGTP$(i)") for i in 1:80]
tbl = @transform tbl {PWGTP_sum = sum([getfield(_, s) for s in syms])}

and then sum up over rows in your @groupby. @piever certainly knows some syntax sugar for my solution above!

One trick is to use _ inside the macro to access the whole table (or the whole row in a “row-based” macro). For example

function test(state::Int64, occ::Int64)
       @applychunked tbl begin
            @where !ismissing(:OCCP) &&
            :ST == state &&
            (:ESR == 1 || :ESR == 2) &&
            :OCCP == occ
            @groupby :PUMA { total = sum(:PWGTP), wgts = [sum(column(_, Symbol("PWGTP$i"))) for i in 1:80] }
        end
    end

cols tries to somehow also discover the column identities statically for extra optimizations so it doesn’t work inside a for loop, but at least the error message could be better. Would you mind opening an issue about this?

In this particular case you can also use JuliaDB special selectors (in this case a Regex) to get all columns that start with PWGTP (special selectors are documented at https://juliacomputing.github.io/JuliaDB.jl/latest/basics/#Selectors-1)

@groupby :PUMA { total = sum(:PWGTP), wgts = [sum(col) for col in columns(_, r"^PWGTP")] }

@MaximilianJHuber @piever Thank you so much! I opened the issue as requested. :+1: :smiley:

I think it would also be good to add this column() function to the (JuliaDBMeta?) documentation as I don’t see it anywhere. I found cols() obviously and I see that JuliaDB has a columns() function but I was unaware of column().