Writing a parser in Julia

Hello,
Has anyone written a parser in Julia for an arbitrary syntax (other than Julia’s)? Any packages available for lexing and parsing?

Thanks,

2 Likes

In terms of helper packages, there is Tokenize.jl which powers CSTParser.jl.

2 Likes

I asked over at ANTLR if Julia generation might be of interest; seems it might, but we’d probably have to volunteer the generator code – or at least participate. I’m interested, but it’s not something I’m working on right now:

https://github.com/antlr/antlr4/issues/2330

Less helpful, perhaps, but … I ended up hand-coding a tiny parser combinator framework for the parser I needed, and Julia turned out to be quite suited for the task :slight_smile: (The code isn’t quite release-worthy yet, I think, but I could send it to you if you want.)

@mlhetland
Thanks Magnus,
Yes, agree. Porting Antlr is going to be a major exercise. I am still learning Julia, so I do not mind writing a small lexer/parser by hand. Seeing an example would really help. I would love to see the code you mentioned.

Best,

This one is for the Reduce language

CorpusLoaders.jl
contain’s a parser for the SemCor format,

WordTokenizers.jl contains many tokenizers (lexers) for natural language.

I’ld love to get ParserCombinators.jl working again.

I think it might be better to start that again though with the new language features.
as I don’t know that it was performant in 0.5,
and just converting it isn’t going to solve that.

1 Like

This might be useful:

For lexing, Automa.jl may do good work.

2 Likes

Thanks everyone for the valuable advice and to Magnus for the code sample. Much appreciated.

I see that Automa reads text bytewise; to what is Unicode (e.g., UTF-8) supported? It seems Unicode character classes (as in PCRE) are not supported; might it be possible to add that, at least to the extent that membership is easy to check (using things like isletter or ispunct or the like)?

Unicode character classes are not supported in Automa.jl. Technically speaking, it is possible to support encoded Unicode data like UTF-8, but it will need some fundamental changes to the code generation process of Automa.jl. Automa.jl changes the current state by reading a single byte from a data stream, and hence functions like isletter or ispunct are not workable here since these functions may need to consume multiple bytes from the stream.