ANN: Automa.jl - a package to compile regular expressions to Julia

Hi all,

I released a new package, Automa.jl, which is a set of tools to compile regular expressions written in Julia DSL into optimized Julia code. Here is a short example to tokenize a string into numerical literals from example/numbers.jl:

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

# Describe patterns in regular expression.
oct      = re"0o[0-7]+"
dec      = re"[-+]?[0-9]+"
hex      = re"0x[0-9A-Fa-f]+"
prefloat = re"[-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)"
float    = prefloat | re.cat(prefloat | re"[-+]?[0-9]+", re"[eE][-+]?[0-9]+")
number   = oct | dec | hex | float
numbers  = re.cat(re.opt(number), re.rep(re" +" * number), re" *")

# Register action names to regular expressions.
number.actions[:enter] = [:mark]
oct.actions[:exit]     = [:oct]
dec.actions[:exit]     = [:dec]
hex.actions[:exit]     = [:hex]
float.actions[:exit]   = [:float]

# Compile a finite-state machine.
machine = Automa.compile(numbers)

# This generates a SVG file to visualize the state machine.
# write("numbers.dot", Automa.dfa2dot(machine.dfa))
# run(`dot -Tpng -o numbers.png numbers.dot`)

# Bind an action code for each action name.
actions = Dict(
    :mark  => :(mark = p),
    :oct   => :(emit(:oct)),
    :dec   => :(emit(:dec)),
    :hex   => :(emit(:hex)),
    :float => :(emit(:float)),
)

# Generate a tokenizing function from the machine.
@eval function tokenize(data::String)
    # Initialize variables you use in the action code.
    tokens = Tuple{Symbol,String}[]
    mark = 0
    emit(kind) = push!(tokens, (kind, data[mark:p-1]))

    # Initialize variables used by the state machine.
    $(Automa.generate_init_code(machine))
    p_end = p_eof = endof(data)

    # This is the main loop to iterate over the input data.
    $(Automa.generate_exec_code(machine, actions=actions))

    # Return found tokens and the final status.
    return tokens, cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end

tokens, status = tokenize("1 0x0123BEEF 0o754 3.14 -1e4 +6.022045e23")

The result looks like this:

julia> tokens, status = tokenize("1 0x0123BEEF 0o754 3.14 -1e4 +6.022045e23");

julia> tokens
6-element Array{Tuple{Symbol,String},1}:
 (:dec,"1")
 (:hex,"0x0123BEEF")
 (:oct,"0o754")
 (:float,"3.14")
 (:float,"-1e4")
 (:float,"+6.022045e23")

julia> status
:ok

The example above would be almost self-explanatory, but let me explain a little bit on its features.

The motivation to have made this package is we need simple and fast parser generators for Julia. In Automa.jl, we can describe a pattern (or a grammar) using the composable DSL and insert actions that will be executed while parsing data. This is especially important in the BioJulia project because we are developing many text parsers to load files of various file formats commonly used in biology. These file formats are often complicated and files are large. So, we decided to develop a compiler that generates fast parsers without hassles. Of course, since the description language is regular expression, we cannot generate parsers for languages with nested structures. But most file formats used in bioinformatics are flat.

The compiler works as follows. First, it translates a set of regular expressions into a finite state machine. Then the machine is optimized to minimize the number of states. Finally, a code generator generates a Julia expression using metaprogramming techniques. Regular expressions can be associated with actions names which will be substituted with some Julia expressions when generating a final Julia code. That means, you can run arbitrary code while executing pattern matching. This is the main difference from other usual regular expressions.

The runtime performance is also amazing. For example, a FASTA file format parser (available here) generated using Automa.jl is as fast as other common hand-written parsers in C. This is because Automa.jl generates fast goto-based code which simulates state transitions.

In BioJulia, we currently use Ragel to generate such parsers. However, it stopped supporting languages except C/C++ and assembly last year. That’s why I decided to develop a new package to replace it with.

Automa.jl is very young. So, I’d like to get nice feedbacks from the community to improve the quality of the package.

Thank you.

13 Likes

@bicycle1885, this look really nice!

So, we decided to develop a compiler that generates fast parsers without hassles. Of course, since the description language is regular expression, we cannot generate parsers for languages with nested structures. But most file formats used in bioinformatics are flat.

And many nested formats like Newick are horrible and deserve to die and never be used, despite their inexplicable popularity :stuck_out_tongue_winking_eye:

Wow, that is excellent! I’m glad the loss of Ragel could be the irritating grain of sand which led to this pearl. This definitely fills a hole in the Julia package ecosystem.

It’s a bit of overkill for FASTA parsing though, isn’t it? A quick FASTA parser just uses the line based nature of the format, and a yield to provide results a record at a time, just a few lines in Python.

I have several slides I dedicate to this subject in my BioJulia talks. The short answer is no, because for some reason in the past Biologists thought - hey let’s store large and important data in text files and manipulate those, and nearly every bioinformatics project at some point implements it’s own FASTA/FASTQ parser because of logic like that you stated: It’s just a few lines right? What could go wrong?

But actually, a lot can go wrong, first people typically don’t bother validating the formats of the files, so we get sloppy adhereance to standards. Then the behaviour of these hand done parsers in various software projects is often slightly different, one might pay attention to metadata on a FASTA name line, others just read it in as part of the name, or ignore it. This leads us to a situation where errors and inconsistencies can appear in our pipelines.

I usually then demonstrate tools where loading in an ASCII picture does not crash or stop the program but gives some garbled result, and I ask, if it’s not complaining about that and just letting it through, what errors and format violations (that are much less obvious and egregious) in your real data is it letting through? And what does that mean for your results?

Generation of text format parsers using formal grammar specifications solves those concerns.

5 Likes

Very nice.

I like the codegen idea.

A few random thoughts:

There is also ParserCombinators.jl which also uses a julia DSL to define a parser, in there case the more powerful Context Free Grammar.
Which some people might be interested in if regular expressions are not enough.
It does not feature any minimization, so implementing your a parser for a regular language in Automa.jl will almost certainly be faster.

I’ve been playing around with shelling out to finite state transducers OpenFST and HFST.
I have a blog post which talked about calling OpenFST. Though HFST is i think more interesting as it makes linguistic stuff easier, rather than the lower level cellular automata focus of OpenFST.
These are super powerful frameworks for this kind of thing
and julia does make calling them easy enough.
It is a bit annoying to actually apply them to a string.
OpenFST makes you first convert your input string into a finite state acceptor (cos it’s low level like that).
HFST is much better, but you still need to do some parsing on the output it returns – which kinda defeat the purpose.

I might give this a shot for my current needs.
The stuff I was FSTing lately, I realized can be expressed as 36 exact replacement rules. So in the interests of not wasting time I just coded that as a series of replace(::String,::String) in a loop.
Though it is not (quiet) deterministic in reverse, so I may end up going back to a weighted FST.

Is there any intent for Automa.jl to go towards a more formal direction?
It seems to be equivalent to a FST.
Which would mean it is open to a whole range of interesting and useful operations.
Like running its inverse, and composing them.
Though I haven’t looked closely enough at what you call “actions” to be sure they are similar enough to an FST’s output symbols.

Is there intent to be supporting one of the pseudo-standard formats for finite state machines like the AT&T FSM format either as input to descibe the rules; or as output – to allow the rules to be manipulated using other tools?

I have no experience of using any FST libraries. So, I may misunderstand what you want but unfortunately I’m not willing to extend the library for more general FSTs at the moment.

Basically, I designed Automa.jl to generate fast file parsers as easy as possible. It is intended to be used in Bio.jl and hence it cannot process data except byte sequences (strings are regarded as a byte sequence in some encoding). However, if you’re handling bytes and you’d like to generate some output for an input, Automa.jl would be useful because it can run “actions” within state transitions. Parsers in Bio.jl are a kind of transducers I think since they take a text data and generate Julia objects. Actions can be any Julia code you want to run. So, you can capture a part of the input and convert it to arbitrary objects using Julia functions as shown in the README page of Automa.jl.

1 Like