charset=Windows-1250, Gumbo, parsing html , Hoow to keep orginal text


#1

I am using this code to parsing html pages:
'
using Gumbo
using AbstractTrees
url="http://rp.pl"
getpage(url) = parsehtml(String(read(download(url))))
text_only(doc::HTMLDocument) = text_only(doc.root)
text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")
get_page_text(url) = text_only(getpage(url))
doc=parsehtml(String(read(download(url))));
only=sort(split(text_only(doc.root[2]) ))
'
If charset=Windows-1250 for this code all natiopnal Char are lost. E.g. życie => �ycie etc.

What teke pages with charset=Windows-1250 ? e.g url="http://rp.pl"
Paul


#2

Julia’s String type expects UTF-8 encoded text. Non-ASCII Windows-1252 text will show up as mojibake or invalid UTF-8.

If you have Windows-1250 text (or other encodings) then you have a few options:

  • Convert to UTF-8. (A package like https://github.com/nalimilan/StringEncodings.jl can help here, although for the specific case of Windows-1250 you can probably make something more efficient if needed.)
  • Use String on the Windows-1252 text as-is. Non-ASCII characters will not be displayed correctly, but the underlying data will be preserved (at least in Julia 0.7), and parsing HTML (which only looks at the ASCII characters, which are the same in Windows-1252 and UTF-8) should work. If you are just sending the non-ASCII data someplace else that expects Windows-1252, and you aren’t processing the non-ASCII text yourself, you can just pass it blindly through like this.
  • Define a new Windows1252String type that represents Windows-1250 data directly; this is a lot of work and is probably not worth it.

Generally, I would recommend just converting the data to UTF-8, especially if the conversion cost is not performance-critical.