Whitespace issue with Gumbo.jl



I realize this probably isn’t an issue with Gumbo.jl but rather with Gumbo itself (unless there’s some switch or something that’s not exposed) – or maybe even the HTML5 parsing algorithm – but I thought I’d see if someone has a not all-too-hacky solution. (A hacky solution won’t be so hard :slight_smile:)

The issue is simply that text like foo<em> </em>bar loses info about the whitespace, so if I extract the text, I end up with foobar, though browsers render this as foo bar, which makes sense to me (although the original markup does not make sense; but it’s not mine).

So … is there any way to tell Gumbo to preserve whitespace, for example, so I can check if there actually is space in there or not?


Can you provide a minimal reproducible example?



julia> using Gumbo
julia> doc = parsehtml("foo<em> </em>bar")
HTML Document:

So, for example:

julia> using Cascadia
julia> nodeText(doc.root)

What I would have liked here is "foo bar".


Aye. That does seem suboptimal. As a workaround maybe

# This is for single space between
doc = parsehtml(replace("foo<em> </em>", r"(?<=>)\s+(?=<)" => "_"))

Maybe open an issue at the repository.


Yeah, I thought about inserting a placeholder of some sort. It just seemed hackish. But, yeah, I’ll open up an issue and see. I suspect this is just a consequence of how the original Gumbo C library behaves, and it hasn’t been updated for years, it seems, so…


Opened an issue: https://github.com/JuliaWeb/Gumbo.jl/issues/64