Field as Variable in Query.jl


#1

I’m a bit confused about why something (admittedly stupid) I was trying to do didn’t work.

Using Query.jl you can write something like:

x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end

Which goes through a DataFrame df, finds all rows where df[:age] is greater than 50 and returns df[:name] and df[:children]. I’d like to use these macros with variables for the data frame columns.

My idea was to do something like eval(parse("i.$variable")). That way I could create a function taking in variable, which then gets read into the string, which goes through parse and eval and (in my mind) should just act as i.variable in the @where macro.

Doing this with some random type works fine and returns the field variable of that type, if it exists. But in the @where macro it says that i is not defined. I assume this happens because of the scope of the @where macro, but I clearly don’t understand how macros work well enough to know why it doesn’t work.

Could somebody help me out a bit? For context, my final goal is to make simple way to go through a large-ish amount of data given a couple of selection criteria, so I’m trying to make a function which could take in a text input and parse parts of it as Query.jl query statements.

So effectively I’d like to make a function that takes in a string like "age>50, name children" and puts i.age>50 into @where and i.name, i.children into @select.


#2

You can try @where getfield(i, variable) > 50 (you could even put this in a function f(i)). I’m not sure how to do for the select statement though.


#3

I played around with this a bit and didn’t get it to work, in the end I ran into world age problems…

This is what I tried:

function foo(name)                          
  f = eval(parse("i->@NT($name = i.$name)"))
  @from i in df begin                       
  @select f(i)                              
  @collect                                  
  end                                       
end                                         

Not sure whether there is a solution to this…

@RobertR, could you post a bit more of the code that you tried? Just wondering whether that might be a more promising avenue.


#4

Thanks for the help, managed to come up with a solution:

function simple_query(q_where::Symbol, q_comp::Symbol, q_val::Any, q_select::Symbol, df::DataFrame)
     @from i in df begin
        @where eval(Expr(:call, q_comp, getfield(i, q_where), q_val))
        @select eval(getfield(i, q_select))
    end
end

For example, the inputs would be:

simple_query(:age, :(>), 50, :name, i)

My attempt to bodge together a way to pass these through as a string:

function string_query(string_in, df)
    query_where, query_select = split(string_in, ", ")

    comp_operators = ["==", ">", ">=", "<", "<="]
    comp = ""

    for c in comp_operators
        if contains(query_where, c)
            comp = c
        end
    end

    query_where_field, query_where_val = split(query_where, string(" ", comp, " "))

    if parse(query_where_val) isa Number
        q_val = parse(query_where_val)
    else
        q_val = replace(String(query_where_val), "\"", "")
    end

    return simple_query(Symbol(query_where_field), Symbol(comp), q_val, Symbol(query_select), df)
end

Into which you can just pass string_query("age > 50, name", df) and it should work.

There’s some weird bit with parsing non-numericals (which I assume are strings) there as my data uses string ID’s like "60001158002" which were being read as Int and causing problems. To query them I decided to make it so that string_query("obsid == \"60001158002\", Downloaded", df) reads "\"60001158002\"" as a SubString, then converts it to a String and strips the "'s so it ends up calling simple_query(:obsid, :(==), "60001158002", :Downloaded, df).

Not very pretty but it works.


#5

Interesting, and glad it works. It is probably pretty slow, though?


#6

Oh yeah, way slower but not noticeably so for a user. With the same query on the same data for both using Query normally takes ~5.7ms and using simple_query with symbols takes ~168ms. This should only be used as a more ‘user friendly’ way to find a handful of results though so it shouldn’t be a problem.