[ANN] PikaParser.jl -- small and fast parser library

It has been around half a year since we released the first version of PikaParser.jl, and we’re at 0.5.1 today successfully using it for several local projects*. So I thought it’s about time to announce it more formally.

What is it about?

Parsing of structured input from users and other programs is traditionally one of the most complicated, error-prone, fragile and otherwise annoying components of any software, and we witnessed too many bugs and security problems that stem from completely insufficient input validation and various forms of regex abuse. In some cases you may avoid the parsing problems by using JSON or XML, or perhaps some less structured and streaming formats. In other cases (especially if you’re facing an actual user) you want your input to be a nice, custom domain-specific language that is tailored precisely for your problem.

Parser libraries and generators such as GNU flex/bison, spirit, and Haskell Parsecs typically help with doing this quite easily, avoiding many common pitfalls. Some Julia packages exist too, such as PEGParser.jl and ParserCombinator.jl, but both are based off packrat parsers over parsing expression grammars that suffers from some mild usability problems – most notably when using left-recursive grammars. Recently a related metod called a pika parser appeared that is quite simple, solves many inconveniences, and comes with great goodies such as getting partial parses when input errors are present. (Btw., pikas are quite similar to pack rats, but apparently more cute.) So why not have a pure Julia implementation of that?

What does it do?

To use the library, you can follow the example from the README (see the same example below), or simply copypaste some code from the available tutorials and start playing with it.

In short, you first build a grammar in a very declarative and error-proof way:

import PikaParser as P

rules = Dict(
    :digits => P.some(:digit => P.satisfy(isdigit)), # parse some digit-ish chars
    :parens => P.seq( # to parse a parenthesis...
        P.token('('),  # ...open the parenthesis, ...
        :expr => P.first(:plusexpr, :minusexpr, :digits, :parens), #...parse something in it, ...
        P.token(')'),  # ...and close the parenthesis
    :plusexpr => P.seq(:expr, P.token('+'), :expr), # some extra math
    :minusexpr => P.seq(:expr, P.token('-'), :expr),

Then you “compile” the description to a fast, flat, integer-based grammar representation:

g = P.make_grammar([:expr],  P.flatten(rules, Char))

Now you can parse a string, getting a (very optimized and inhumane) bunch of possible parse trees:

p = P.parse(g, "123-((345)+567)")

In this bunch, you can find your match and traverse its corresponding parse tree, interpreting it in the process:

    P.find_match_at!(p, :expr, 1),
    fold = (m, p, subvals) ->
        m.rule == :digits ? parse(Int, m.view) :
        m.rule == :expr ? subvals[1] :
        m.rule == :parens ? subvals[2] :
        m.rule == :plusexpr ? subvals[1] + subvals[3] :
        m.rule == :minusexpr ? subvals[1] - subvals[3] : nothing,

…which gets you the parsed expression evaluated (in this case getting -789).

This example can be vastly expanded – there’s also support for:

  • fast input scanners and pre-lexing (which, in this case, saves tons of unnecessary allocations in the parse tree, and you can squeeze out some extra speed),
  • various context-sensitive tools specific for PEGs, such as the “followed-by” token match,
    (CompSci bonus: this allows the PEG grammars to describe even some context-sensitive languages, such as the canonical L=\{a^nb^nc^n | n\in\mathbb{N}\})
  • processing any kind of token sequence imaginable – if your sequence type supports integer-ish indexing with [] and the standard view, firstindex, lastindex, prevind and nextind, chances are that you can parse it with PikaParser.

Some hints of that are available in the documentation and tests.
Enjoy the parsing!

PS. I was wondering where to continue from here – this solved the problem for us, and I’m not really sure where to go next. I guess that some support for good and helpful parser error messages and error recovery would be cool. Having the declarative grammar clauses and their interpretation functions in a single structure (like with bison) might be useful too. Please feel free to share opinions.

PPS.: Coming next: Library for layout-aware pretty-printing.


Nice package! I just read about it in the Wikipedia page then Julia has a native solution of it.

1 Like