Writing a parser in Julia

Drgo · August 20, 2018, 3:57pm

Hello,
Has anyone written a parser in Julia for an arbitrary syntax (other than Julia’s)? Any packages available for lexing and parsing?

Thanks,

dawbarton · August 20, 2018, 4:15pm

In terms of helper packages, there is Tokenize.jl which powers CSTParser.jl.

mlhetland · August 20, 2018, 5:19pm

I asked over at ANTLR if Julia generation might be of interest; seems it might, but we’d probably have to volunteer the generator code – or at least participate. I’m interested, but it’s not something I’m working on right now:

https://github.com/antlr/antlr4/issues/2330

Less helpful, perhaps, but … I ended up hand-coding a tiny parser combinator framework for the parser I needed, and Julia turned out to be quite suited for the task (The code isn’t quite release-worthy yet, I think, but I could send it to you if you want.)

Drgo · August 20, 2018, 7:38pm

@mlhetland
Thanks Magnus,
Yes, agree. Porting Antlr is going to be a major exercise. I am still learning Julia, so I do not mind writing a small lexer/parser by hand. Seeing an example would really help. I would love to see the code you mentioned.

Best,

chakravala · August 21, 2018, 6:33am

This one is for the Reduce language

oxinabox · August 21, 2018, 8:18am

CorpusLoaders.jl
contain’s a parser for the SemCor format,

github.com

JuliaText/CorpusLoaders.jl/blob/master/src/SemCor.jl

struct SemCor{S}
    filepaths::Vector{S}
end

function init_datadeps(::Type{SemCor})
    for (ver, checksum) in [("1.6", "16814254fe194d55a2fcc24858aa76d71de3c49e495bd98478cc7345e766d8b7"),
                ("1.7", "0495577ac3a87c2a64fe6189798ea046de0f44943dfb7b60fe38cf648d34c421"),
                ("1.7.1", "70b9eb7ca0dc9d67655f9d671d40be10aeff490f0bea4f10cb1946127b74c102"),
                ("2.0", "93fbae725f0125dedb7369403fda1dace85b2dcd8a523ed80af23e863b18ef2c"),
                ("2.1", "0714f07dbcb84a215d668f3ee85892fa8fa4a8154439662eb7529413367b8f56"),
                ("3.0", "a8000014d6fc864f8bd9d83c62be601151cadd617c6554a39a1ad38b4b3f017b")]

        register(DataDep("SemCor $ver",
            """
            Website: http://web.eecs.umich.edu/%7Emihalcea/downloads.html#semcor
            Orignal Author: George A. Miller et al.
            Maintainer: Rada Mihalcea
            For WordNet version $ver

            This is SemCor orginally developed along side WordNet.

This file has been truncated. show original

WordTokenizers.jl contains many tokenizers (lexers) for natural language.

I’ld love to get ParserCombinators.jl working again.

github.com/andrewcooke/ParserCombinator.jl

Make work in 0.6

andrewcooke:master ← oxinabox:oxff6

opened 12:02PM - 26 Nov 17 UTC

oxinabox

+333 -347

Ok this is building on #26 and #21 I'm not saying this is the prettiest code…, but it passes all tests. In a few places I got rid of inner constructors, and replaced them with outer constructors. They are much more sensibly behaved, being just functions. I'ld like to do that everywhere, but for now I am happy just to have it working. I also drop all support for 0.5. And remove Compat. Its just easier to stop maintaining old versions (at least til 1.0) (In my opinion) Particularly given this package is stable so the version to version difference is likely just deprecation fixes

I think it might be better to start that again though with the new language features.
as I don’t know that it was performant in 0.5,
and just converting it isn’t going to solve that.

mlhetland · August 21, 2018, 8:25am

This might be useful:

bicycle1885 · August 21, 2018, 9:18am

For lexing, Automa.jl may do good work.

Drgo · August 21, 2018, 3:01pm

Thanks everyone for the valuable advice and to Magnus for the code sample. Much appreciated.

mlhetland · August 29, 2018, 5:03pm

I see that Automa reads text bytewise; to what is Unicode (e.g., UTF-8) supported? It seems Unicode character classes (as in PCRE) are not supported; might it be possible to add that, at least to the extent that membership is easy to check (using things like isletter or ispunct or the like)?

bicycle1885 · August 30, 2018, 2:13am

Unicode character classes are not supported in Automa.jl. Technically speaking, it is possible to support encoded Unicode data like UTF-8, but it will need some fundamental changes to the code generation process of Automa.jl. Automa.jl changes the current state by reading a single byte from a data stream, and hence functions like isletter or ispunct are not workable here since these functions may need to consume multiple bytes from the stream.

Topic		Replies	Views
State of the Art in Combinator Parsers? General Usage	8	1361	September 22, 2020
ANN: Automa.jl - a package to compile regular expressions to Julia Community package , announcement	6	2848	February 2, 2017
Parsing Julia code Internals & Design question	3	1192	May 22, 2021
Suggestions for Projects that add features to Julia Teaching & Outreach question	16	1534	October 16, 2020
Resources for writing a parser/lexer General Usage	6	2003	October 1, 2020

Writing a parser in Julia

Related topics