I realize this probably isn’t an issue with Gumbo.jl
but rather with Gumbo itself (unless there’s some switch or something that’s not exposed) – or maybe even the HTML5 parsing algorithm – but I thought I’d see if someone has a not all-too-hacky solution. (A hacky solution won’t be so hard )
The issue is simply that text like foo<em> </em>bar
loses info about the whitespace, so if I extract the text, I end up with foobar
, though browsers render this as foo bar
, which makes sense to me (although the original markup does not make sense; but it’s not mine).
So … is there any way to tell Gumbo to preserve whitespace, for example, so I can check if there actually is space in there or not?
Can you provide a minimal reproducible example?
Sure:
julia> using Gumbo
julia> doc = parsehtml("foo<em> </em>bar")
HTML Document:
<!DOCTYPE >
<HTML>
<head></head>
<body>
foo
<em></em>
bar
</body>
</HTML>
So, for example:
julia> using Cascadia
julia> nodeText(doc.root)
"foobar"
What I would have liked here is "foo bar"
.
Aye. That does seem suboptimal. As a workaround maybe
# This is for single space between
doc = parsehtml(replace("foo<em> </em>", r"(?<=>)\s+(?=<)" => "_"))
Maybe open an issue at the repository.
Yeah, I thought about inserting a placeholder of some sort. It just seemed hackish. But, yeah, I’ll open up an issue and see. I suspect this is just a consequence of how the original Gumbo C library behaves, and it hasn’t been updated for years, it seems, so…