Whitespace issue with Gumbo.jl

mlhetland · December 19, 2018, 2:33pm

I realize this probably isn’t an issue with Gumbo.jl but rather with Gumbo itself (unless there’s some switch or something that’s not exposed) – or maybe even the HTML5 parsing algorithm – but I thought I’d see if someone has a not all-too-hacky solution. (A hacky solution won’t be so hard )

The issue is simply that text like foo<em> </em>bar loses info about the whitespace, so if I extract the text, I end up with foobar, though browsers render this as foo bar, which makes sense to me (although the original markup does not make sense; but it’s not mine).

So … is there any way to tell Gumbo to preserve whitespace, for example, so I can check if there actually is space in there or not?

Nosferican · December 19, 2018, 5:09pm

Can you provide a minimal reproducible example?

mlhetland · December 19, 2018, 6:35pm

Sure:

julia> using Gumbo
julia> doc = parsehtml("foo<em> </em>bar")
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    foo
    <em></em>
    bar
  </body>
</HTML>

So, for example:

julia> using Cascadia
julia> nodeText(doc.root)
"foobar"

What I would have liked here is "foo bar".

Nosferican · December 19, 2018, 7:05pm

Aye. That does seem suboptimal. As a workaround maybe

# This is for single space between
doc = parsehtml(replace("foo<em> </em>", r"(?<=>)\s+(?=<)" => "_"))

Maybe open an issue at the repository.

mlhetland · December 20, 2018, 2:56pm

Yeah, I thought about inserting a placeholder of some sort. It just seemed hackish. But, yeah, I’ll open up an issue and see. I suspect this is just a consequence of how the original Gumbo C library behaves, and it hasn’t been updated for years, it seems, so…

mlhetland · December 20, 2018, 3:06pm

Opened an issue: Tagged whitespace disappears · Issue #64 · JuliaWeb/Gumbo.jl · GitHub

Topic		Replies	Views
(The future of) HTML Parsing in Julia Web Stack question , package	3	597	June 15, 2023
Problem with parsing tag(::HTMLText) in Gumbo General Usage	0	358	February 10, 2020
Find last section header with Gumbo.jl General Usage question , package , web , html	0	457	July 12, 2022
Julia-Gumbo-webscraping Data question	16	3463	October 24, 2019
As simple as possible from the website, extract the raw text? General Usage	4	1189	February 9, 2018

Whitespace issue with Gumbo.jl

Related topics