Convert string encoding to UTF-8

Thomas_Lei · March 23, 2021, 12:54pm

Hi all,

Somewhere in my code I read strings which contains unicode characters.
For example, the reading output is:
myString = "caf\xe9"

I’d like this string to be parsed as
myString = "café"

I played along with StringEncodings.jl but with no sucess. Any ideas how I can do that ?

Thanks

stevengj · March 23, 2021, 1:15pm

That’s the Latin-1 encoding of "café", not Unicode. You need to convert this to the UTF-8 encoding of Unicode as used by Julia. I’m guessing that you are on Windows and that this is actually Windows-1252, since Latin-1 is not common anymore elsewhere.

Two options:

Read it into Julia as bytes (Vector{UInt8} via read(io)) and convert the encoding with StringEncodings.jl or some similar package. e.g. decode(Vector{UInt8}("caf\xe9"), "Windows-1252") gives "café".
Change your files to use UTF-8. Windows-1252 is an archaic encoding that can only encode 256 characters, nowhere near all of Unicode. People should really stop using it. See here for various tools. (e.g. For a single file, you can just open it in Notepad or some other editor and choose “UTF-8” when you save, but there are also batch tools to re-encode many files at once.)

(In any case, this not not technically about “parsing”, which is a distinct concept from “encoding”.)

Thomas_Lei · March 23, 2021, 4:54pm

Indeed, the encoding is “windows-1252”. I can’t change the files encoding to UTF-8 as they are produced by an external system which I don’t have access to…

If anyone has the same question, this works like a charm:

using StringEncodings
my_string = "caf\xe9"
my_string = decode(Vector{UInt8}(my_string), "windows-1252")

Thanks !
(I also edited the post title)

Topic		Replies	Views
Readstring encoding General Usage	8	1830	November 16, 2020
String encodings help General Usage	7	2232	January 6, 2018
Reading a UTF-16-LE file General Usage question	3	4646	June 15, 2018
Decoding EBCDIC array to string General Usage	8	2201	July 9, 2019
How to get file encoding? General Usage strings , io	4	1846	December 1, 2021

Convert string encoding to UTF-8

Related topics