(The future of) HTML Parsing in Julia

Hello world,

I am writing this here hoping to find someone more knowledgeable than me on the Julia ecosystem around web / HTML.

My context in a few words: I am working on a larger project that involves lots of HTML manipulations (reliable parsing is paramount). I started the project by using the Gumbo.jl package, and I realized a while back that the C counterpart was archived. Since Gumbo.jl is a wrapper around the C version, I think it is safe to assume that it will stay frozen and not implement new features dealing with new HTML standard changes/additions. Continuing to build on Gumbo.jl seems kind of risky at this point.

I was also looking at EzXML.jl package - but that one also seems not maintained (some community pull requests are just standing there without being merged).

Don’t get me wrong: I do not want people to serve me “the parser” on a plate - I want to understand the direction of web programming in Julia - and an HTML parser seems to me like a kind of cornerstone. I am also willing to contribute, but I don’t want to bet on the wrong horse here: maybe there is a direction that has a better chance of succeeding in the long run, and having just another small Julia package without documentation or contributors is not going to cut it.

Maybe some of the maintainers of Genie.jl can help with this. I understand that Genie.jl has some internals that deals with HTML parsing - but it seems that the functionality is tailored to templating and MVC approach. Maybe a solution to have a standalone HTML parser is to separate the parsing from Genie.jl as a standalone package.

However, can anybody help with some directions here?

2 Likes

There are a few other packages that you didn’t list that could be helpful.

GitHub - tlienart/Franklin.jl: (yet another) static site generator. Simple, customisable, fast, maths with KaTeX, code evaluation, optional pre-rendering, in Julia. Also must have some capabilities in this area I’m guessing.

1 Like

Thanks for pointing out the additional materials.

I was looking specifically for the HTML parsing functionality - although HypertextLiteral.jl looks really great for generating HTML on the fly (and I wasn’t aware of the package).

I was aware of the Cobweb.jl - but I actually needed to dive into the source code to find the parser (the documentation doesn’t say anything about that).

I just tested it - it seems a feasible and easy-to-extend/maintain alternative to Gumbo.jl project.