Just got a project with quite a few iso8859-1 encoded files. Is there a concise way to readstring
a given file in Julia that is not utf-8?
UPDATED: forgot about StringEncodings.jl
Just got a project with quite a few iso8859-1 encoded files. Is there a concise way to readstring
a given file in Julia that is not utf-8?
UPDATED: forgot about StringEncodings.jl
You might want to roll your own for this, as converting ANSI Latin-1 (ISO 8859-1) to UTF-8 is pretty trivial.
Basically, any byte < 0x80, is output unchanged, bytes between 0x80 and 0xbf get output as 0xc2 followed by the byte, and between 0xc0 and 0xff, as 0xc3 followed by the byte - 0x40.
From my experience, it’s a lot faster to scan the input for any bytes > 0x7f (and count them as you do so),
if the count is 0, then you can make the bytes directly into a String
, since there are only ASCII characters,
otherwise, you can use the count to allocate the output vector with the exact length (for v0.6 and later, you can use Base.StringVector(n)
).
Thanks! Never worked much explicitly with this encoding issue. All files have now been converted.
Ah, good. If this were something where you were constantly getting new files, and the performance mattered, then I’d recommend rolling your own as above (I’m thinking of adding a native Julia string encoding/decoding package to the JuliaString repo also, for cases where using iconv (as StringEncodings does) or ICU is too slow.
I went ahead and benchmarked a little ANSI Latin-1 to UTF-8 conversion function (about 18 lines of Julia code), and it was about 45x faster, with 1/4.5x memory allocated, when converting a string that had some characters > 0x80, and 230x faster, with 0 allocations (the vector that was read in from the file was simply converted to a String directly), compared to allocating 3.23MB (4.5x the size of the being converted!), compared to using StringEncodings.
I think I need to get off my bum and get this into a package!
I think one of the issues here isn’t just the speed, it’s discovering a lot of functionality buried away in packages. I am a great fan of Julia and have done most of my work over the last 3 years+ in the language, but I can imagine a new user coming along and getting somewhat frustrated trying to do simple things like reading a file encoding differently than that assumed by Base. Although this isn’t just a Julia problem, nonetheless, perhaps there is some solution that might cut the search time.
Regarding your enhanced conversion function, perhaps this is something that needs to be done within the StringEncodings package.
Actually, many people (myself included), would like to have much that is currently in base Julia be moved out into either packages, or optionally loaded pre-compiled modules (unlike many other languages, Julia includes what would be one or more standard libraries in Base, which slows down loading / building / testing when you don’t need everything.
I’d contributed in the past to the StringEncodings package, that might be a place for it, but I do have some rather different ideas for what I’d like to accomplish, and I really want it to be owned by an organization (like the JuliaString that I’m trying to get off the ground), with multiple people who can review/merge/etc.
I fully agree with a minimal/standard base, perhaps it is still too early, but I can see a standard set of packages developing that users could easily reference for most of their basic needs. Maybe its the org thing, maybe its a committee, maybe its just a massive index, centralized doc thinggy… I am sure this is already being hotly debated somewhere.
Another 3 years on, and your comment is still relevant.