Convert string encoding to UTF-8

Hi all,

Somewhere in my code I read strings which contains unicode characters.
For example, the reading output is:
myString = "caf\xe9"

I’d like this string to be parsed as
myString = "café"

I played along with StringEncodings.jl but with no sucess. Any ideas how I can do that ?

Thanks

That’s the Latin-1 encoding of "café", not Unicode. You need to convert this to the UTF-8 encoding of Unicode as used by Julia. I’m guessing that you are on Windows and that this is actually Windows-1252, since Latin-1 is not common anymore elsewhere.

Two options:

  1. Read it into Julia as bytes (Vector{UInt8} via read(io)) and convert the encoding with StringEncodings.jl or some similar package. e.g. decode(Vector{UInt8}("caf\xe9"), "Windows-1252") gives "café".

  2. Change your files to use UTF-8. Windows-1252 is an archaic encoding that can only encode 256 characters, nowhere near all of Unicode. People should really stop using it. See here for various tools. (e.g. For a single file, you can just open it in Notepad or some other editor and choose “UTF-8” when you save, but there are also batch tools to re-encode many files at once.)

(In any case, this not not technically about “parsing”, which is a distinct concept from “encoding”.)

6 Likes

Indeed, the encoding is “windows-1252”. I can’t change the files encoding to UTF-8 as they are produced by an external system which I don’t have access to…

If anyone has the same question, this works like a charm:

using StringEncodings
my_string = "caf\xe9"
my_string = decode(Vector{UInt8}(my_string), "windows-1252")

Thanks !
(I also edited the post title)

3 Likes