(The future of) HTML Parsing in Julia

algunion · June 14, 2023, 3:02pm

Hello world,

I am writing this here hoping to find someone more knowledgeable than me on the Julia ecosystem around web / HTML.

My context in a few words: I am working on a larger project that involves lots of HTML manipulations (reliable parsing is paramount). I started the project by using the Gumbo.jl package, and I realized a while back that the C counterpart was archived. Since Gumbo.jl is a wrapper around the C version, I think it is safe to assume that it will stay frozen and not implement new features dealing with new HTML standard changes/additions. Continuing to build on Gumbo.jl seems kind of risky at this point.

I was also looking at EzXML.jl package - but that one also seems not maintained (some community pull requests are just standing there without being merged).

Don’t get me wrong: I do not want people to serve me “the parser” on a plate - I want to understand the direction of web programming in Julia - and an HTML parser seems to me like a kind of cornerstone. I am also willing to contribute, but I don’t want to bet on the wrong horse here: maybe there is a direction that has a better chance of succeeding in the long run, and having just another small Julia package without documentation or contributors is not going to cut it.

Maybe some of the maintainers of Genie.jl can help with this. I understand that Genie.jl has some internals that deals with HTML parsing - but it seems that the functionality is tailored to templating and MVC approach. Maybe a solution to have a standalone HTML parser is to separate the parsing from Genie.jl as a standalone package.

However, can anybody help with some directions here?

tbeason · June 14, 2023, 3:10pm

There are a few other packages that you didn’t list that could be helpful.

GitHub - tlienart/Franklin.jl: (yet another) static site generator. Simple, customisable, fast, maths with KaTeX, code evaluation, optional pre-rendering, in Julia. Also must have some capabilities in this area I’m guessing.

algunion · June 15, 2023, 3:10am

Thanks for pointing out the additional materials.

I was looking specifically for the HTML parsing functionality - although HypertextLiteral.jl looks really great for generating HTML on the fly (and I wasn’t aware of the package).

algunion · June 15, 2023, 3:41am

I was aware of the Cobweb.jl - but I actually needed to dive into the source code to find the parser (the documentation doesn’t say anything about that).

I just tested it - it seems a feasible and easy-to-extend/maintain alternative to Gumbo.jl project.

Topic		Replies	Views
What library do you suggest to parse HTML page and additionally navigate through the page New to Julia	2	569	December 31, 2019
Reading HTML file for parsing General Usage	1	934	December 19, 2022
Manipulating HTML DOM using Julia Web Stack question	3	2314	August 8, 2020
[ANN] Lexbor.jl - HTML parser wrapping the C library lexbor Package Announcements web , parsing , html	4	254	January 9, 2025
Support for HTML data (not HTML strings) Web Stack	11	1770	October 19, 2019

(The future of) HTML Parsing in Julia

Related topics