Is it possible to avoid eval in this code?

I would like to write a macro which subsets a collection of variables (with shared dimensions). My use case can be simplified to a dictionary of vectors where each vector has the same length, for example:

ds = Dict(:a => 1:10,:b => 2:11)

I would like to subset the variable ds based on the values associated to :a and :b (and not based on the indices). The macro syntax is the following

ds2 = @select(ds,a < 5)

which corresponds to

Dict( (k,v[ds[:a] .< 5]) for (k,v) in ds )

One should also be able to substitute variable prefixed by $ as it is the case with the @btime macro (from BenchmarkTools):

limit = 5
ds2 = @select(ds,a < $limit)

The code below is an implementation of this @select macro. I would like to know if it is possible to avoid the evil eval function and if it is possible to simplify the code (as there is a lot of quoting/escaping).

Thank you for sharing your ideas! :slight_smile:

Below is a short implementation of the @select macro. In my case, the values of the dictionary are arrays (not just vectors) with named dimensions but I think the simplified case (using vectors) should be sufficient to show my question here.

# ds will always by a dictionary whoes values are
# arrays of the same size
ds = Dict(
    :a => 1:10,
    :b => 2:11)

@assert all(sz -> sz == size(first(values(ds))),size.(values(ds)))

# helper function to recursively scan the expression

function scan_exp!(exp::Symbol,varnames,found)
    if exp in varnames
        push!(found,exp)
    end
    return found
end

function scan_exp!(exp::Expr,varnames,found)
    for arg in exp.args
        scan_exp!(arg,varnames,found)
    end
    return found
end

# neither Expr nor Symbol
scan_exp!(exp,varnames,found) = nothing

scan_exp(exp::Expr,varnames) = scan_exp!(exp::Expr,varnames,Symbol[])


function scan_coordinate_name(exp,coordinate_names)
    params = scan_exp(exp,coordinate_names)
    @assert length(params) == 1
    param = params[1]
    return param
end

macro select(ds,condition)
    exp2 = Meta.quot(condition)

    quote
        coord_names = keys($ds)
        exp = $(esc(exp2))
        param = scan_coordinate_name(exp,coord_names)
        fun = eval(Expr(:->,param,exp))
        # avoid world age problem
        ind = Base.invokelatest(findall,fun,ds[param])

        Dict( (k,v[ind]) for (k,v) in ds )
    end
end


function test_fun(ds)
    ds2 = @select(ds,a < 5)
    @show ds2
end

function test_fun2(ds)
    limit = 5
    ds2 = @select(ds,a < $limit)
    @show ds2
end

# both function should return
# Dict(:a => [1, 2, 3, 4], :b => [2, 3, 4, 5])
#
# which corresponds to
#
# Dict( (k,v[ds[:a] .< 5]) for (k,v) in ds )


test_fun(ds)
test_fun2(ds)

Interpolation of variables inside macro calls has to be dealt with manually by the macro. I think this is how BenchmarkTools does it:

2 Likes

Minor digression:

is easier written return only(params).

5 Likes

Definitely you should not be using eval or invokelatest at all. A macro is a code rewriter. It should simply rewrite @select(ds,a < 5) into Dict( (k,v[ds[:a] .< 5]) for (k,v) in ds ) if the later is what you want.

4 Likes

Perhaps useful to read the implementation of select here: GitHub - johnmyleswhite/Volcanito.jl: A backend agnostic for tabular data operations in Julia

1 Like

Thank you all for your feedback.

@stevengj It is not clear to me how to do that as I do not know that a is the key in ds at parse time.
For example, in @select(ds,a < π), a should be substituted by ds[:a] but not π. This is currently determined at runtime by the function scan_coordinate_name.

The current macro implementation works also for:

@select(ds,abs(a) > 3)
@select(ds,2 < b < 10)
@select(ds,!ismissing(b))

Here is some context where this macro is used:

Why not use a symbol syntax for the keys? For example, write @select(df, :a < pi)

The key to using macros is to think of them as purely syntax rewriters. If what you want to do cannot be inferred from the syntax alone, then you are not using macros appropriately.

2 Likes

+1 to using a Symbol for the keys, like in DataFramesMeta.jl. Without, its very hard to know what is a variable contained in the object and what exists outside. Symbols get around this problem. Moreover, they make it easier to read.

Volcanito looks really nice! The syntax feels really natural. Volcanito.@where corresponds exactly to my proposed @select macro.

While this works as expected:

@where(df, a < 2)

I get unfortunately an error with the following:

@where(df, a < π)

It seems that all variable names need to be columns of the dataframe:

Error showing value of type Volcanito.Selection:
ERROR: ArgumentError: column name :π not found in the data frame; existing most similar names are: :a, :b and :c

But this works as expected

@where(df, a < $π)
1 Like

Yes, that’s the other option: either you syntactically mark which variable are columns (e.g. with :a < π) or you syntactically mark which variables are not columns (e.g. with a < $π). The latter makes a lot of sense if you expect column names to be more common than other variables (which seems likely), and also $ interpolation is pretty idiomatic in Julia.

2 Likes

Maybe, it’s more natural to express this in reverse, as an array of dicts, or even namedtuples? Then your operation is just filter(x -> x[:a] < 5) or filter(x -> x.a < 5).
This layout is often more convenient to interoperate with arrays/collections/tables/etc ecosystems.

1 Like

@Alexander-Barth: To make one point in Steven’s post explicit, a macro only has access to syntactic information. But the predicate “does this identifier refer to a column in my table?” can only be computed using runtime information about which columns exist in your table. As such, the column vs variable distinction cannot be made by a macro unless you indicate the answer using a syntactic construct. In DataFramesMeta.jl, we made the distinction using : to indicate a column. In Volcanito.jl, I made the distinction using $ to indicate a variable.

2 Likes

Interesting idea! This seems to be the approach also taken by JuliaDB.

Thank you all very much for your insights here! All comments turned out to be very useful to me!

I implemented both approaches for this test case (referred as :a < π and a < $π above, where a is a key in my dictionary). Both approaches were quite comparable. I will probably continue to use a < $π (marking with $ variables which are not keys in the dictionary).

For reference, this is the eval free approach:

# ds will always by a dictionary whoes values are
# arrays of the same size
ds = Dict(
    :a => 1:10,
    :b => 2:11)

@assert all(sz -> sz == size(first(values(ds))),size.(values(ds)))

# helper function to recursively scan the expression

function scan_exp!(exp::Symbol,found)
    newsym = gensym()
    push!(found,exp => newsym)
    return newsym
end

function scan_exp!(exp::Expr,found)
    if exp.head == :$
        return exp.args[1]
    end

    if exp.head == :call
        # skip function name
        return Expr(exp.head,exp.args[1],scan_exp!.(exp.args[2:end],Ref(found))...)
    else
        return Expr(exp.head,scan_exp!.(exp.args[1:end],Ref(found))...)
    end
end

# neither Expr nor Symbol
scan_exp!(exp,found) = exp
function scan_exp(exp::Expr)
    found = Pair{Symbol,Symbol}[]
    exp = scan_exp!(exp,found)
    return found,exp
end

function select(ds, (param,cond))
    ind = findall(cond,ds[param])
    return Dict( (k,v[ind]) for (k,v) in ds )
end

macro select(ds,condition)
    params,condition = scan_exp(condition)
    param,newsym = only(params)
    fun = Expr(:->,newsym,condition)
    quote
        select(ds, $(Meta.quot(param)) => $(esc(fun)))
    end
end

function test_fun(ds)
    ds2 = @select(ds,a < 5)
    @show ds2
end

function test_fun2(ds)
    limit = 5
    ds2 = @select(ds,a < $limit)
    @show ds2
end

function test_fun3(ds)
    a = 5
    ds2 = @select(ds,a < $a)
    @show ds2
end

# all functions should return
# Dict(:a => [1, 2, 3, 4], :b => [2, 3, 4, 5])
#
# which corresponds to
#
# Dict( (k,v[ds[:a] .< 5]) for (k,v) in ds )


test_fun(ds)
test_fun2(ds)
test_fun3(ds)