String encodings help


#1

I’m looking for some help in determining what character sets are compatible with UTF-8 and therefore safe to be transcoded from Vector{UInt8} using the String constructor. I’m not an expert with character coding and not sure how to accurately do that. Any help or pointer to references is much appreciated!

These are the encodings identifiable from a SAS data file using my SASLib.jl package. I believe that US-ASCII, ISO-8859-1 and WINDOWS-1252 are OK.
What about those variations of ISO-8859-x and WINDOWS-125x?

"BIG-5"
"CP932"
“EUC-JP”
“EUC-KR”
“EUC-TW”
"GB18030"
"ISO-8859-1"
"ISO-8859-11"
"ISO-8859-2"
"ISO-8859-3"
"ISO-8859-6"
"ISO-8859-7"
"ISO-8859-8"
"ISO-8859-9"
“US-ASCII”
"WINDOWS-1250"
"WINDOWS-1251"
"WINDOWS-1252"
"WINDOWS-1253"
"WINDOWS-1254"
"WINDOWS-1255"
"WINDOWS-1256"
“WINDOWS-1257”


#2

I’m mistaken… anything other than US-ASCII would have to be converted to UTF-8 somehow. Just found a great post from before.

@ScottPJones, is your neat conversion utility already in a package or something that I can reuse?


#3

I must be blinded to not catch this.

Great work, Scott!


#4

I think StringEncodings.jl is currently the only Julia package for converting non-Unicode encodings to UTF-8. Scott’s Strs.jl package mentions that he is working on a StrEncodings.jl package for conversion also, but as far as I can tell nothing has been posted yet. @nalimilan seems to have kept StringEncodings.jl up-to-date, however, and it is based on the robust and well-tested iconv library; I see no reason not to use it.


#5

If they are really ISO-8859-1, it’s actually very easy - read in the bytes, and then make a vector of Chars out of them, and then make a String out of that. That works because 8859-1 is a pure 8-bit subset of Unicode (as is ASCII a pure 7-bit subset). With the Strs.jl package, can simply read them in as LatinStr directly.

The problem comes in if it’s not really 8859-1, but rather Microsoft’s CP-1252, which is almost the same as 8859-1, but adds a bunch of printable characters (such as the Euro sign € at 0x80). Then you’d need a (simple) mapping table for the code points between 0x80 and 0x9f.

It may be a few weeks before I get conversions added, I want to do that in pure Julia, in a way that tables can be loaded on the fly (and also supply some utility functions to look at what something like iconv or ICU has for mappings, and build a compressed table that can be used later by my StrEncodings).
If you want to get a feel for how that would work, please take a look at what I did to read and process things like the Unicode data file, as well as other sources for HTML entities, LaTeX entities, and Emojis, in some of the other packages I did in the JuliaString organization.


#6

It’s still very WIP now, needs a lot more testing, benchmarking, and optimization done.
Hopefully shortly it will be something people can make use of.


#7

Depending on the problem, it may also be viable to use iconv or recode directly, once, on the file(s), obtain UTF8 and forget about the whole issue afterwards. This is what I usually do, last non-UTF8 files I have seen were from 10+ years ago.


#8

I already depend on StringEncodings.jl but I am trying to optimize performance by avoiding it when there’s a faster path. As Scott pointed out, certain conversions are quite simple & easy to implement so it would be a nice win.