Parsing selection syntax

My question is not exclusively about Julia, but Julia is one language in which I see a possible solution.

What I want is to implement a natural syntax to select items from a list (actually atoms from a molecular system), using something like:

selection = select("index < 1000 and not water and (name C or name N)",atoms)

where atoms is a vector of elements of type atom, which might be something like:

struct Atom
    name 
    index
    element
    coordinates
end

and, of course, we could define functions like iswater(atom) to deal with with the not water condition, for example.

I thought that I could simply parse such a statement “index < …” converting it to a Julia-code, with
something like

function convert(selection)
  expr = ""
  for keyword in split(selection)
    if keyword == "index"
      expr = expr*" atom.index "
    elseif keyword == "and"
      expr = expr*" && " 
    elseif keyword == "not"
      expr = expr*" ! "
    elseif keyword == "water"
      expr = expr*" iswater(x) "
    else
      expr = expr*" $keyword "
    end
  end
  return expr
end

such that

convert("index < 100 and not water")

returns

" atom.index  <  100  &&  !  iswater(x) "

From that, I could use something like:

function select(atoms,selection)
  expr = convert(selection)
  selection = Vector{Atoms}(undef,0)
  for atom in atoms
    if eval(Meta.parse(expr))
      append(selection,atom)
    end
  end
  return selection
end
# Obs: actually this function does not work, because the 
# interpolation of the atom variable on "eval" fails, but
# I need yet to understand that. But probably the idea of what
# I aim to obtain is clear.

So, my general question is: is this strategy (of generating meta-codes) a reasonable choice to tackle this problem? Writing an actual interpreter of the user-provided syntax seems too complicated.

The more specific question, if the approach is reasonable, is how to evaluate the generated expression for each element of an array, as it seems to work for a single element, but when inside a loop or function, the variable is no longer recognized:

julia> expr = :(x < 1);
julia> y  = [ 0, 1 ];
julia> for x in y
         eval(expr)
       end
ERROR: UndefVarError: x not defined


(a third possibility is that someone knows of a package that does that already, I could not find one).

Metaprogramming is not magic. You should not try to use eval for this. It’ll be strictly slower than other methods.

In fact, there’s nothing magical. You always need to do the same amount of work and the only thing julia may help here is syntax.

The first question for you is that what is your input. Does it have to be a random string that is only known at runtime, for example if you are prompting the user for input from a REPL? If yes then you should just go with however you do it in other language. You may try to encode the parsed result in a type to speed up repeated execution but that’s the extend you can do.

If the string is not free-formed, but rather is always constant at construction or with limited templating syntax on top, then you should just make the user construct the code, as function callback or other types encoding the result, directly, and this is exactly what you can do in most other languages as well. The only difference julia/metaprogramming can make is to write a macro/string macro so that you can do your own parsing to generate such an object.

1 Like

this gives off pandas.query vibe… (which I dislike, whenever I need to re-visit or re-use them it’s always a mess).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

Maybe helpful: https://thautwarm.github.io/MLStyle.jl/v0.3.1/tutorials/query-lang/

1 Like

The input is variable, of course, it will be an user input, which will be parsed from an input file.

The user is not expected to know anything about programming, he/she might not even notice that the package is written in Julia. What I want is to reproduce the same selection syntax of popular packages for molecular simulations.

The selection function does not need to very fast, that is far from being the computationally expensive part of the code.

What I think that meta-programing can give in this case is not speed, it just a very simple way to translate the syntax of a selection into a code to be evaluated in Julia.

So, to be clearer:

The function convert above, translates the syntax index < 100 and not water, which is familiar to the user, to the corresponding Julia code for the same conditional: atom.index < 100 && ! iswater(atom).

In principle, I can write a file with the following function, given that result:

f -> atom -> atom.index  <  100  &&  !  iswater(atom)

My code could continue simply including that file, and aftewards the selection of the atoms could be done with

selection = filter(f,atoms)

The meta-programing part of the story is to avoid writing a file to disk and having to include that file afterwards.

1 Like

Maybe I can ask a very simple question:

Given the string

s = "x > 1".

Is there a simple way to use this string to define a function, such as:

f = x -> x > 1

I tried different forms of interpolation, but I am a quite lost in the documentation of the meta-programing part. Maybe that is not even what meta-programing is thought for, and a simpler way to do that exists, of course not involving writing and including files from disk.

EDIT: This works:

s = "x > 1"
f = eval(Meta.parse("x -> "*s))
f(2)
true

Such that it is possible to use the code above with:

selection_expression = convert("index < 100 and not water")

if it returns, by adding atom -> as the first string and Meta.parse(expr) at the end, the expr:

:(atom -> atom.index < 100 && ! iswater(atom))

With that it is possible to define, in the REPL, the selection function:

f = eval(expr)

and use it in

filter(f,atoms)

I found some threads warning about the use of eval with expressions provided by users, which can result very unexpected results, which is true. So I will keep searching for a better solution.

The final result is the one below, used in a simple example where I use the selection to compute the center of the coordinates of a subset of atoms of an atom list.

Actually I like the result. It is fast wherever that matters, which is in the function that actually does the computations.

If anyone thinks that I am doing something highly non-recommendable, I will be glad to get the advice.


# Structure with its properties
struct Atom
  index
  element
  coor
end

# One specific function that might be used for selection
function ismetal(atom :: Atom)
  if atom.element == "Fe"
    true
  else
    false
  end
end

# Function that converts the natural expression to the Julia conditional
function select_expr(selection :: String)
  expr = "atom -> "
  for keyword in split(selection)
    if keyword == "index"
      expr = expr*" atom.index "
    elseif keyword == "and"
      expr = expr*" && "
    elseif keyword == "not"
      expr = expr*" ! "
    elseif keyword == "metal"
      expr = expr*" ismetal(atom) "
    else
      expr = expr*" $keyword "
    end
  end
  return Meta.parse(expr)
end

# And this converts the above to the function
select(string) = eval(select_expr(string))

# Example of use: a function that computes the center of coordinates
# of the selected atoms
function center(atoms,selection_function)
  center = zeros(3)
  nsel = 0
  for atom in atoms
    if selection_function(atom)
      center[1] = center[1] + atom.coor[1]
      center[2] = center[2] + atom.coor[2]
      center[3] = center[3] + atom.coor[3]
      nsel = nsel + 1
    end
  end
  @. center = center / nsel
  return center
end

# Let us define the atoms:
atoms = Vector{Atom}(undef,100)
for i in 1:100
  atoms[i] = Atom(i,rand(["Fe","C"]),rand(3))
end

# Choose some of them with the natural syntax (actually defines
# the function)
sel = select("index < 80 and not metal")

# What is passed to the function that computes the center is the
# function, thus this will be fast
c = center(atoms,sel)

println(c)

I reinforce that, in my case, I will not change the selection function many times, and this is not important for the speed of the code in general. I only need a parser for a natural syntax, not a very efficient code to define selections. That is why this solution is acceptable for me. I still have no idea how I would have to manage this without meta-programing.

No that’s not what metaprogramming is for. This will one of the worst way to do it.

That is just as bad and will give you exactly the error you get otherwise. Writing to a file and include is nothinng more than include_string or parse and eval that code.

That’s parse and eval by definition.

1 Like

Those are all good reasons that you should not use this. The single good thing about creating a function like this is if you actually care about how fast it evaluate many times later. This is also not really any more metaprograming than what you’ll do in any other language. The better way to do this, same as what all other language would use, is to simply do the thing instead emitting a string to do anything. You are generating a straight line expression, meaning that you basically only need to keep track of a few values as you evaluate the string.

1 Like

I think the approach you are taking will be difficult to generalize to a more complicated query grammar. Here’s an approach that avoids metaprogramming.

The idea is to build up an Abstract Syntax Tree (AST). But I don’t mean a Julia AST, I mean an AST for your query language. The AST for your query language can be represented by nested tuples.

Atom code

struct Atom
	index::Int
	iswater::Bool
	isiron::Bool
end

iswater(a::Atom) = a.iswater
isiron(a::Atom) = a.isiron

a = Atom(50, true, false)
b = Atom(200, false, false)
c = Atom(200, false, true)

Example query to parse

s = "index < 100 or not water and not iron"

Parsing function

The following function parse_query parses a query string into a nested tuple that represents the AST of your query. Each tuple represents an S-expression where the first argument of the tuple is a function and the remaining arguments of the tuple are the arguments of that function.

A couple things to note:

  • parse_query is a recursive function, because the query could have arbitrarily nested ands, ors, and nots.
  • The if-else block is ordered from lowest operator precedence to highest operator precedence, with "or" being the lowest precedence and function application (like iswater) the highest precedence.
function parse_query(s)
    if occursin("or", s)
        (|, parse_query.(split(s, "or"))...)
    elseif occursin("and", s)
        (&, parse_query.(split(s, "and"))...)
    elseif occursin("not", s)
        rest = match(r".*not(.*)", s)[1]
        (!, parse_query(rest))
    elseif occursin("index <", s)
        k = parse(Int, match(r"index < ([0-9]*)", s)[1])
        a -> a.index < k
    elseif occursin("water", s)
        iswater
    elseif occursin("iron", s)
        isiron
    else
        error("Parsing error!")
    end
end

Let’s run this on the query string and see what we get:

julia> q = parse_query(s)
(|, var"#5#6"{Int64}(100), (&, (!, iswater), (!, isiron)))

So we see that we get our AST as a nested tuple! Now we need a way to apply the AST to an atom.

Apply parsed query to atom

function apply_query(q, a)
	if !(q isa Tuple)
		q(a)
	else
		f, args = Iterators.peel(q)
		f(apply_query.(args, Ref(a))...)
	end
end

This function recursively calls the first argument of each tuple with the remaining arguments of the tuple. When apply_query gets down to a point in the AST where q is not a tuple, it assumes that q is a function (e.g., either iswater, isiron, or a -> a.index < k) and it calls that function on the atom a.

Now let’s apply our parsed query to an atom:

julia> apply_query(q, a)
true

julia> apply_query(q, b)
true

julia> apply_query(q, c)
false

It works, and all without using any metaprogramming!

EDIT:
I had my operator precedence wrong in the initial post, so I’ve fixed that.

11 Likes

Thank you very much. I have to study to understand exactly how that works, and particularly how to deal with parenthesis, but it is clear what I have to do.

2 Likes

If acceptable with the users, S-expressions may be easiest to parse.

Just to mention that I found out now that the “Chemfiles.jl” package has a pretty powerful selection syntax, which is quite similar to what I want:

http://chemfiles.org/Chemfiles.jl/latest/tutorials/#Using-selections

Thus, I almost certainly will go with that.

2 Likes