Hi all,
I released a new package, Automa.jl, which is a set of tools to compile regular expressions written in Julia DSL into optimized Julia code. Here is a short example to tokenize a string into numerical literals from example/numbers.jl:
import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp
# Describe patterns in regular expression.
oct = re"0o[0-7]+"
dec = re"[-+]?[0-9]+"
hex = re"0x[0-9A-Fa-f]+"
prefloat = re"[-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)"
float = prefloat | re.cat(prefloat | re"[-+]?[0-9]+", re"[eE][-+]?[0-9]+")
number = oct | dec | hex | float
numbers = re.cat(re.opt(number), re.rep(re" +" * number), re" *")
# Register action names to regular expressions.
number.actions[:enter] = [:mark]
oct.actions[:exit] = [:oct]
dec.actions[:exit] = [:dec]
hex.actions[:exit] = [:hex]
float.actions[:exit] = [:float]
# Compile a finite-state machine.
machine = Automa.compile(numbers)
# This generates a SVG file to visualize the state machine.
# write("numbers.dot", Automa.dfa2dot(machine.dfa))
# run(`dot -Tpng -o numbers.png numbers.dot`)
# Bind an action code for each action name.
actions = Dict(
:mark => :(mark = p),
:oct => :(emit(:oct)),
:dec => :(emit(:dec)),
:hex => :(emit(:hex)),
:float => :(emit(:float)),
)
# Generate a tokenizing function from the machine.
@eval function tokenize(data::String)
# Initialize variables you use in the action code.
tokens = Tuple{Symbol,String}[]
mark = 0
emit(kind) = push!(tokens, (kind, data[mark:p-1]))
# Initialize variables used by the state machine.
$(Automa.generate_init_code(machine))
p_end = p_eof = endof(data)
# This is the main loop to iterate over the input data.
$(Automa.generate_exec_code(machine, actions=actions))
# Return found tokens and the final status.
return tokens, cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end
tokens, status = tokenize("1 0x0123BEEF 0o754 3.14 -1e4 +6.022045e23")
The result looks like this:
julia> tokens, status = tokenize("1 0x0123BEEF 0o754 3.14 -1e4 +6.022045e23");
julia> tokens
6-element Array{Tuple{Symbol,String},1}:
(:dec,"1")
(:hex,"0x0123BEEF")
(:oct,"0o754")
(:float,"3.14")
(:float,"-1e4")
(:float,"+6.022045e23")
julia> status
:ok
The example above would be almost self-explanatory, but let me explain a little bit on its features.
The motivation to have made this package is we need simple and fast parser generators for Julia. In Automa.jl, we can describe a pattern (or a grammar) using the composable DSL and insert actions that will be executed while parsing data. This is especially important in the BioJulia project because we are developing many text parsers to load files of various file formats commonly used in biology. These file formats are often complicated and files are large. So, we decided to develop a compiler that generates fast parsers without hassles. Of course, since the description language is regular expression, we cannot generate parsers for languages with nested structures. But most file formats used in bioinformatics are flat.
The compiler works as follows. First, it translates a set of regular expressions into a finite state machine. Then the machine is optimized to minimize the number of states. Finally, a code generator generates a Julia expression using metaprogramming techniques. Regular expressions can be associated with actions names which will be substituted with some Julia expressions when generating a final Julia code. That means, you can run arbitrary code while executing pattern matching. This is the main difference from other usual regular expressions.
The runtime performance is also amazing. For example, a FASTA file format parser (available here) generated using Automa.jl is as fast as other common hand-written parsers in C. This is because Automa.jl generates fast goto-based code which simulates state transitions.
In BioJulia, we currently use Ragel to generate such parsers. However, it stopped supporting languages except C/C++ and assembly last year. That’s why I decided to develop a new package to replace it with.
Automa.jl is very young. So, I’d like to get nice feedbacks from the community to improve the quality of the package.
Thank you.