Readstring encoding

merl-dev · June 13, 2017, 3:27pm

Just got a project with quite a few iso8859-1 encoded files. Is there a concise way to readstring a given file in Julia that is not utf-8?

UPDATED: forgot about StringEncodings.jl

ScottPJones · June 13, 2017, 5:24pm

You might want to roll your own for this, as converting ANSI Latin-1 (ISO 8859-1) to UTF-8 is pretty trivial.
Basically, any byte < 0x80, is output unchanged, bytes between 0x80 and 0xbf get output as 0xc2 followed by the byte, and between 0xc0 and 0xff, as 0xc3 followed by the byte - 0x40.
From my experience, it’s a lot faster to scan the input for any bytes > 0x7f (and count them as you do so),
if the count is 0, then you can make the bytes directly into a String, since there are only ASCII characters,
otherwise, you can use the count to allocate the output vector with the exact length (for v0.6 and later, you can use Base.StringVector(n)).

merl-dev · June 13, 2017, 6:27pm

Thanks! Never worked much explicitly with this encoding issue. All files have now been converted.

ScottPJones · June 13, 2017, 7:06pm

Ah, good. If this were something where you were constantly getting new files, and the performance mattered, then I’d recommend rolling your own as above (I’m thinking of adding a native Julia string encoding/decoding package to the JuliaString repo also, for cases where using iconv (as StringEncodings does) or ICU is too slow.

ScottPJones · June 13, 2017, 7:53pm

I went ahead and benchmarked a little ANSI Latin-1 to UTF-8 conversion function (about 18 lines of Julia code), and it was about 45x faster, with 1/4.5x memory allocated, when converting a string that had some characters > 0x80, and 230x faster, with 0 allocations (the vector that was read in from the file was simply converted to a String directly), compared to allocating 3.23MB (4.5x the size of the being converted!), compared to using StringEncodings.
I think I need to get off my bum and get this into a package!

merl-dev · June 13, 2017, 8:20pm

I think one of the issues here isn’t just the speed, it’s discovering a lot of functionality buried away in packages. I am a great fan of Julia and have done most of my work over the last 3 years+ in the language, but I can imagine a new user coming along and getting somewhat frustrated trying to do simple things like reading a file encoding differently than that assumed by Base. Although this isn’t just a Julia problem, nonetheless, perhaps there is some solution that might cut the search time.
Regarding your enhanced conversion function, perhaps this is something that needs to be done within the StringEncodings package.

ScottPJones · June 13, 2017, 9:20pm

Actually, many people (myself included), would like to have much that is currently in base Julia be moved out into either packages, or optionally loaded pre-compiled modules (unlike many other languages, Julia includes what would be one or more standard libraries in Base, which slows down loading / building / testing when you don’t need everything.
I’d contributed in the past to the StringEncodings package, that might be a place for it, but I do have some rather different ideas for what I’d like to accomplish, and I really want it to be owned by an organization (like the JuliaString that I’m trying to get off the ground), with multiple people who can review/merge/etc.

merl-dev · June 13, 2017, 11:28pm

I fully agree with a minimal/standard base, perhaps it is still too early, but I can see a standard set of packages developing that users could easily reference for most of their basic needs. Maybe its the org thing, maybe its a committee, maybe its just a massive index, centralized doc thinggy… I am sure this is already being hotly debated somewhere.

victorromeo · November 16, 2020, 10:55pm

Another 3 years on, and your comment is still relevant.

Topic		Replies	Views
Convert string encoding to UTF-8 New to Julia strings	2	3839	March 23, 2021
String encodings help General Usage	7	2340	January 6, 2018
Is it possible to write to a (UTF8) data file like ANSI? (iso_8859_2") General Usage	1	497	April 7, 2018
Reading a UTF-16-LE file General Usage question	5	4855	December 6, 2025
How to get file encoding? General Usage strings , io	5	2047	March 24, 2026

Readstring encoding

Related topics