Convert string encoding to UTF-8

That’s the Latin-1 encoding of "café", not Unicode. You need to convert this to the UTF-8 encoding of Unicode as used by Julia. I’m guessing that you are on Windows and that this is actually Windows-1252, since Latin-1 is not common anymore elsewhere.

Two options:

  1. Read it into Julia as bytes (Vector{UInt8} via read(io)) and convert the encoding with StringEncodings.jl or some similar package. e.g. decode(Vector{UInt8}("caf\xe9"), "Windows-1252") gives "café".

  2. Change your files to use UTF-8. Windows-1252 is an archaic encoding that can only encode 256 characters, nowhere near all of Unicode. People should really stop using it. See here for various tools. (e.g. For a single file, you can just open it in Notepad or some other editor and choose “UTF-8” when you save, but there are also batch tools to re-encode many files at once.)

(In any case, this not not technically about “parsing”, which is a distinct concept from “encoding”.)

8 Likes