[ANN] PikaParser.jl -- small and fast parser library

MirekKratochvil · January 28, 2023, 4:42pm

It has been around half a year since we released the first version of PikaParser.jl, and we’re at 0.5.1 today successfully using it for several local projects*. So I thought it’s about time to announce it more formally.

What is it about?

Parsing of structured input from users and other programs is traditionally one of the most complicated, error-prone, fragile and otherwise annoying components of any software, and we witnessed too many bugs and security problems that stem from completely insufficient input validation and various forms of regex abuse. In some cases you may avoid the parsing problems by using JSON or XML, or perhaps some less structured and streaming formats. In other cases (especially if you’re facing an actual user) you want your input to be a nice, custom domain-specific language that is tailored precisely for your problem.

Parser libraries and generators such as GNU flex/bison, spirit, and Haskell Parsecs typically help with doing this quite easily, avoiding many common pitfalls. Some Julia packages exist too, such as PEGParser.jl and ParserCombinator.jl, but both are based off packrat parsers over parsing expression grammars that suffers from some mild usability problems – most notably when using left-recursive grammars. Recently a related metod called a pika parser appeared that is quite simple, solves many inconveniences, and comes with great goodies such as getting partial parses when input errors are present. (Btw., pikas are quite similar to pack rats, but apparently more cute.) So why not have a pure Julia implementation of that?

What does it do?

To use the library, you can follow the example from the README (see the same example below), or simply copypaste some code from the available tutorials and start playing with it.

In short, you first build a grammar in a very declarative and error-proof way:

import PikaParser as P

rules = Dict(
    :digits => P.some(:digit => P.satisfy(isdigit)), # parse some digit-ish chars
    :parens => P.seq( # to parse a parenthesis...
        P.token('('),  # ...open the parenthesis, ...
        :expr => P.first(:plusexpr, :minusexpr, :digits, :parens), #...parse something in it, ...
        P.token(')'),  # ...and close the parenthesis
    ),
    :plusexpr => P.seq(:expr, P.token('+'), :expr), # some extra math
    :minusexpr => P.seq(:expr, P.token('-'), :expr),
)

Then you “compile” the description to a fast, flat, integer-based grammar representation:

g = P.make_grammar([:expr],  P.flatten(rules, Char))

Now you can parse a string, getting a (very optimized and inhumane) bunch of possible parse trees:

p = P.parse(g, "123-((345)+567)")

In this bunch, you can find your match and traverse its corresponding parse tree, interpreting it in the process:

P.traverse_match(p,
    P.find_match_at!(p, :expr, 1),
    fold = (m, p, subvals) ->
        m.rule == :digits ? parse(Int, m.view) :
        m.rule == :expr ? subvals[1] :
        m.rule == :parens ? subvals[2] :
        m.rule == :plusexpr ? subvals[1] + subvals[3] :
        m.rule == :minusexpr ? subvals[1] - subvals[3] : nothing,
)

…which gets you the parsed expression evaluated (in this case getting -789).

This example can be vastly expanded – there’s also support for:

fast input scanners and pre-lexing (which, in this case, saves tons of unnecessary allocations in the parse tree, and you can squeeze out some extra speed),
various context-sensitive tools specific for PEGs, such as the “followed-by” token match,
(CompSci bonus: this allows the PEG grammars to describe even some context-sensitive languages, such as the canonical L=\{a^nb^nc^n | n\in\mathbb{N}\})
processing any kind of token sequence imaginable – if your sequence type supports integer-ish indexing with [] and the standard view, firstindex, lastindex, prevind and nextind, chances are that you can parse it with PikaParser.

Some hints of that are available in the documentation and tests.
Enjoy the parsing!
-mk

PS. I was wondering where to continue from here – this solved the problem for us, and I’m not really sure where to go next. I guess that some support for good and helpful parser error messages and error recovery would be cool. Having the declarative grammar clauses and their interpretation functions in a single structure (like with bison) might be useful too. Please feel free to share opinions.

PPS.: Coming next: Library for layout-aware pretty-printing.

melonedo · February 12, 2023, 1:47pm

Nice package! I just read about it in the Wikipedia page then Julia has a native solution of it.

NAS · June 25, 2023, 7:50pm

First off- Excellent package and, for me personally, great timing. I was looking for a way to go about parsing an uncommon data format and I think this will do just the trick.

I’m trying to figure out how to use this effectively for one part of my use case. I’m attempting to write a parser for the Zinc specification from Project Haystack: Zinc – Project Haystack.

I’m working on the <id> component and have an issue that I need to match strings of the pattern: <alphaLo> (<alphaLo> | <alphaHi> | <digit> | '_')* and the challenge is that if I have something like: "type,val" I will match "l", "la","val","e","pe","ype","type" which is essentially all partial strings that comply in addition to the full string.

The id rule portion is:

:id =>P.seq(
        :alphaLo,
        P.many(
            :inId => P.first(
                :alphaLo,
                :alphaHi,
                :digit,
                P.token('_'),
            ),
        ),
        P.not_followed_by(:inId),
    ),

I’m not sure if my matching rule needs to be different or if this is something I can handle when I traverse and fold. Any suggestions that might help point me in the right direction?

caleb-allen · June 26, 2023, 12:06pm

Oh my goodness, this package is a godsend! I am really excited to try it out.

I’ve been working on VimBindings.jl, and one of the messier parts is the code which parses vim commands. I started it by just using a list of regexes, but as new features have been added these regexes have become a little horrifying. Vim commands can get quite long, with complex actions being built up - it’s not as simple as 1 key = 1 action. Plus, it’s been a big mental block for me to add new features, like registers (which uses "x at the beginning of a command) because I’ll have to refactor all of the regexes.

Suffice it to say, I’d really like to refactor the parsing code, and this looks like a great candidate. Really approachable, too. Exactly what I need to make my package a lot more flexible and extensible!

lmiq · June 26, 2023, 1:29pm

Can this be used in principle to parse something with this kind of syntax?

atoms = [ 
    (name="C", mass="12.0", x=0.0, y=0.0, z=0.0), 
    (name="N", mass="14.0", x=1.0, y=1.0, z=1.0) 
]

selection = select(atoms, "name C N and x < 1.0")

Just a Yes/No will suffice, if Yes I’ll study how to do it . (I something of the sort implemented in a package, but still with limited functionality, and I’ve been willing to improve this for a long time, but until now that would require writing a complete parser from scratch, which I didn’t have the time to).

MirekKratochvil · August 5, 2023, 2:06pm

Hi all, sorry for late reply, I was having kinda holiday. Let me answer individually.

MirekKratochvil · August 5, 2023, 2:12pm

Can this be used in principle to parse something with this kind of syntax?

Short answer: Yes, totally!

Long answer: If I get it right, you want to just parse and evaluate the query? First make sure your grammar has clean distinction between keywords and non-keywords, i.e. for every weird string the enemy throws at you, you will know what to expect. Then it should be possible very easily, you’ll just rewrite your formal rules into a grammar and it will work. The common way is to make a data representation of the query, then format of the query to strings, and the parser is simply a reverse process of that. :]

In your case I’m not sure how the name C N should parse – I’d instead go for something less likely ambiguous such as (name C or name N) and x < 1.0, or maybe have proper quotes such as (name = 'C' or name = 'N') and x < 1.0.

Actually we parse a very similar language in COBREXA for gene-reaction relations (which are also kinda boolean expressions); you may get inspired at https://github.com/LCSB-BioCore/COBREXA.jl/blob/master/src/base/utils/gene_associations.jl#L86-L155 .

MirekKratochvil · August 5, 2023, 2:16pm

parsing by regexes

Ah great this helps! vimscript is not very complex so PikaParser should generally help; there might be a few dark areas (sometimes the parsing there is veeeeeeeery context specific) but in case of issues just ping me.

I’m certainly happy that this might prevent use of regexes for parsing, that was one of the wannabe original wishlist goals I had when writing PikaParser, and man, it’s so nice to see that it kinda works.

MirekKratochvil · August 5, 2023, 2:32pm

Zinc parsing

The matching of “partial” things is one of properties of packrat (and thus also pika) parsers, and formally there isn’t much help against that (many people tried and failed; you may have a look at the whole peg-packrat-lazypackrat-pika paper series on Scholar). As a theoretical property it’s certainly unpleasant.

BUT LUCKILY – we are programmers and we can just completely avoid tokenizing out the nonsential things. Normal languages (basically all computer languages and parsers) solve this problem by doing a first simpler pass that uses a “lexer” to get sensible chunks of the strings out, only then running the (relatively expensive) parser to actually make meaning of the simple tokens.

(The formal difference between parser and lexer is that the lexer is a regular language that can be parsed by a fast finite automaton (Chomsky’s type-3), and parser is a context-free or slightly context-sensitive language (requires pushdown automata and/or backtracking, corresponds usually to Chomsky type-2)).

PikaParser supports relatively good lexing, using the grammar rules you already have constructed – see the function parse_lex (here: Search · PikaParser.jl ). If you combine that with simple token scanners (in short, do not process the input using P.token but instead use P.scan), you should get something that avoids the intermediate parses completely, and gets MUCH faster as a side effect.

That said, I thought I have documented the parse_lex better, yet I did not. I should add a demo. Thanks for the reminder.

MirekKratochvil · August 5, 2023, 4:44pm

@NAS aaaaaaaaaanyway

https://lcsb-biocore.github.io/PikaParser.jl/dev/scheme_lex/

Happy parsing!