Hi all,
Somewhere in my code I read strings which contains unicode characters.
For example, the reading output is:
myString = "caf\xe9"
I’d like this string to be parsed as
myString = "café"
I played along with StringEncodings.jl but with no sucess. Any ideas how I can do that ?
Thanks
That’s the Latin-1 encoding of "café"
, not Unicode. You need to convert this to the UTF-8 encoding of Unicode as used by Julia. I’m guessing that you are on Windows and that this is actually Windows-1252, since Latin-1 is not common anymore elsewhere.
Two options:
-
Read it into Julia as bytes (Vector{UInt8}
via read(io)
) and convert the encoding with StringEncodings.jl or some similar package. e.g. decode(Vector{UInt8}("caf\xe9"), "Windows-1252")
gives "café"
.
-
Change your files to use UTF-8. Windows-1252 is an archaic encoding that can only encode 256 characters, nowhere near all of Unicode. People should really stop using it. See here for various tools. (e.g. For a single file, you can just open it in Notepad or some other editor and choose “UTF-8” when you save, but there are also batch tools to re-encode many files at once.)
(In any case, this not not technically about “parsing”, which is a distinct concept from “encoding”.)
6 Likes
Indeed, the encoding is “windows-1252”. I can’t change the files encoding to UTF-8 as they are produced by an external system which I don’t have access to…
If anyone has the same question, this works like a charm:
using StringEncodings
my_string = "caf\xe9"
my_string = decode(Vector{UInt8}(my_string), "windows-1252")
Thanks !
(I also edited the post title)
3 Likes