charset=Windows-1250, Gumbo, parsing html , Hoow to keep orginal text

programista · February 15, 2018, 4:23pm

I am using this code to parsing html pages:
’
using Gumbo
using AbstractTrees
url=“http://rp.pl”
getpage(url) = parsehtml(String(read(download(url))))
text_only(doc::HTMLDocument) = text_only(doc.root)
text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")
get_page_text(url) = text_only(getpage(url))
doc=parsehtml(String(read(download(url))));
only=sort(split(text_only(doc.root[2]) ))
’
If charset=Windows-1250 for this code all natiopnal Char are lost. E.g. życie => �ycie etc.

What teke pages with charset=Windows-1250 ? e.g url=“http://rp.pl”
Paul

stevengj · February 15, 2018, 6:04pm

Julia’s String type expects UTF-8 encoded text. Non-ASCII Windows-1252 text will show up as mojibake or invalid UTF-8.

If you have Windows-1250 text (or other encodings) then you have a few options:

Convert to UTF-8. (A package like https://github.com/nalimilan/StringEncodings.jl can help here, although for the specific case of Windows-1250 you can probably make something more efficient if needed.)
Use String on the Windows-1252 text as-is. Non-ASCII characters will not be displayed correctly, but the underlying data will be preserved (at least in Julia 0.7), and parsing HTML (which only looks at the ASCII characters, which are the same in Windows-1252 and UTF-8) should work. If you are just sending the non-ASCII data someplace else that expects Windows-1252, and you aren’t processing the non-ASCII text yourself, you can just pass it blindly through like this.
Define a new Windows1252String type that represents Windows-1250 data directly; this is a lot of work and is probably not worth it.

Generally, I would recommend just converting the data to UTF-8, especially if the conversion cost is not performance-critical.

Topic		Replies	Views
As simple as possible from the website, extract the raw text? General Usage	4	1190	February 9, 2018
Problem with parsing tag(::HTMLText) in Gumbo General Usage	0	358	February 10, 2020
Julia-Gumbo-webscraping Data question	16	3466	October 24, 2019
(The future of) HTML Parsing in Julia Web Stack question , package	3	597	June 15, 2023
Is there a ready-made function to convert a Gumbo.jl parsed html table into a table like DataFrames.DataFrame? General Usage	2	948	March 1, 2021

charset=Windows-1250, Gumbo, parsing html , Hoow to keep orginal text

Related topics