Whitespace issue with Gumbo.jl

question

#1

I realize this probably isn’t an issue with Gumbo.jl but rather with Gumbo itself (unless there’s some switch or something that’s not exposed) – or maybe even the HTML5 parsing algorithm – but I thought I’d see if someone has a not all-too-hacky solution. (A hacky solution won’t be so hard :slight_smile:)

The issue is simply that text like foo<em> </em>bar loses info about the whitespace, so if I extract the text, I end up with foobar, though browsers render this as foo bar, which makes sense to me (although the original markup does not make sense; but it’s not mine).

So … is there any way to tell Gumbo to preserve whitespace, for example, so I can check if there actually is space in there or not?


#2

Can you provide a minimal reproducible example?


#3

Sure:

julia> using Gumbo
julia> doc = parsehtml("foo<em> </em>bar")
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    foo
    <em></em>
    bar
  </body>
</HTML>

So, for example:

julia> using Cascadia
julia> nodeText(doc.root)
"foobar"

What I would have liked here is "foo bar".


#4

Aye. That does seem suboptimal. As a workaround maybe

# This is for single space between
doc = parsehtml(replace("foo<em> </em>", r"(?<=>)\s+(?=<)" => "_"))

Maybe open an issue at the repository.


#5

Yeah, I thought about inserting a placeholder of some sort. It just seemed hackish. But, yeah, I’ll open up an issue and see. I suspect this is just a consequence of how the original Gumbo C library behaves, and it hasn’t been updated for years, it seems, so…


#6

Opened an issue: https://github.com/JuliaWeb/Gumbo.jl/issues/64