The bug which #27273 fixes (for Windows) is really very old, it goes back to a Jan. 11, 2012 (over a year before Julia v0.1 was released!) commit by @jeff.bezanson to j/libc.j (over a year before Julia v0.1 was released!).
The problem was invalid UTF-8 characters from strftime() depending on Windows system locales.
is inaccurate, the characters are not “invalid”, they are completely valid EUC-KR (or CP949) encoded characters, as expected if you call strftime on any platform where the locale setting LC_TIME is not set to some *.UTF-8 locale.
The fix needs to be applied for platforms (which makes it simpler anyway), and a similar fix needs to be made for strptime as well.
They could be valid EUC-KR characters but it does not mean they are valid UTF-8. EUC-KR is not compatible with UTF-8 except ASCII plane. It bothers people in this culture because EUC-KR is not compatible with both codepage and encoding. For example, 한 is 0xc7d0 in EUC-KR, but U+d55c in Unicode and will be encoded as 0xed9f9c in UTF-8.
Currently Julia is not aware of other encodings except Unicode I guess. (at least in base and stdlib) Python 3 only supports str in Unicode yet use different encodings for encode/decode bytes. strftime() function of time module in Python 3 returns correct str. I also want to use Unicode (especially UTF-8) mainly in Julia. It requires every bytes which are not compatible with Unicode must be transcoded. I think it will be great if packages such as StringEncodings.jl becomes base or stdlib.
The fix needs to be applied for platforms (which makes it simpler anyway), and a similar fix needs to be made for strptime as well.
Thank you! I should search wchar_t version of strptime(). If I understand correctly, do you meen the problem can also happen in different platforms, not only Windows?
My point was that calling them “invalid UTF-8” leads people to believe they were supposed to be UTF-8,
which was not the case at all. The function correctly returned them using EUC-KR encoding, based on the setting of LC_TIME, as noted in the documentation for the strftime and strptime functions on the different platforms. Note that these functions are part of the Open Group Unix standard, the Posix standard, as well as the ISO C standard, going back at least 30 years.
Most all of the times over the years I’ve seen people have “corrupted” text data, it’s simply been a case of misidentification, the data was actually valid, it just wasn’t UTF-8.
Since String in Julia is supposed to be UTF-8, the strftime function needs to handle converting whatever the C library strftime or wcsftime function correctly returned, to the UTF-8 encoding required by the String type.
One solution to the problem is using wcsftime / wcsptime with transcode, however,
it might be better for Julia to always set it’s locale to a UTF-8 one (in your case, ko_KR.UTF-8).
That might cause problems for C code that Julia calls, depending on whether they use the locales correctly, but that’s probably not that likely.
Yes, it’s just that most people on Macs or Linux have LC_TIME set to *.UTF-8.
Here is the output from locale on my laptop:
What is it on your system? (I hope that Windows has the locale command, if not, you might need to write a little C program to get the setting of LC_TIME).
Here is an example of the bug on the Mac (with %A instead of %Z, because for some reason on the Mac, the time zone always comes out with the ASCII abbreviation):
Right. Instead of saying strftime returns invalid UTF-8, I should say strftime returns EUC-KR string which Julia does not know about it. Other languages mostly support functions for converting Unicode (mostly ‘UTF-8’) from/to system encoding. In my opinion, Julia also needs to support those features.
The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.
This is why setlocale() fails with UTF-7 or UTF-8 on Windows.
Yes, sometimes UTF-8 consumes more memory and time to indexing, especially when I use Hangul (Korean letters) which are all 2-bytes in Unicode and encoded as 3-bytes in UTF-8.
I think it might be possible let the runtime to choose a proper encoding for given String implicitly. For example, if using ASCII only, Julia chooses UTF-8 to reduce memory usage, and if using lots of Hangul characters, Julia changes to use UTF-16. But I’m not sure, this can be another overhead.
Yes, definitely. It’s part of what I’d planned to add when I get around to implementing a StrEncodings.jl package.
It’s interesting, that in Julia, you are getting the CP949 codepage, but Cygwin changes it to the Korean UTF-8 locale, for better Unix compatibility.
What happens if you start Julia from Cygwin?
Or if you set the environment variable in Cygwin? export LC_TIME="en-EN.UTF-8" for example?
Please check out JuliaString.org , especially the Strs package.
(Ignore all the red badges, those are because of the changes a few days ago to switch from Pkg → Pkg3 on master, which I’m still trying to adapt to. Everything works fine on v0.6.2).
It has a UniStr (Union) type, which selects between ASCIIStr, _LatinStr, _UCS2Str, or _UTF32Str types for the most efficient (in space / performance). It’s frequently many times faster than the base String type.
If you like working on string handling, internationalization issues, etc, maybe you’d be interested in contributing!
Some very fine people have already joined the organization
To make myself more clear: I feel that the bugs in strftime and strptime come from a lack of awareness of how character set encodings are used around the world, stemming from this UTF-8 only or centric view.
Also, many other bugs that I’ve seen over the last 3 years in Julia itself, or in many of the packages that deal with strings (such as JSON, CSV, all of the database wrappers) come from either 1) lack of identification of the character set / encoding (such as in the strftime/strptime case), assuming everything would be UTF-8, or 2) issues caused by the complexity of dealing with multi-codeunit encodings such as UTF-8, such indexing into the middle of UTF-8 sequences, incorrectly specifying the end of a range of characters [lastindex vs. sizeof, for example], etc.
I agree that having an single recommended string type (but not necessarily a single internal string representation!) for most use, especially with the high numbers of Julians who are researchers, professors, mathematicians, scientists of all types but not so many CS types, is a good thing, but it should be something that is easy for them to use, which is why I’ve been working on a UniStr type that does not have the issues that complex encodings such as UTF-8 have.
It also needs to be able to handle in an easy fashion converting back and forth to other encodings such as UTF-8, UTF-16, Cwstring (i.e. either UTF-16 or UTF-32 depending on platform), taking care of any system conversions (such as with strftime) to/from the system’s character set / encoding for that function / locale (i.e. LC_TIME).
I think the fact that nobody on the GitHub discussion recognized that it was an issue with not respecting locale settings, and not some Windows specific problem, illustrates my point.
If I hadn’t spoken up by starting this post on Discourse, it likely would have only been fixed for Windows, and only strftime and not strptime.
Bugs usually happen for reasons, whether it be typos, off-by-one issues, confusable names, and for strings it’s often encoding / indexing issues, and in this case it seems, a lack of awareness of locale related issues.
Just sweeping it under the rug by saying “it’s just a bug” means that people won’t go and check for other places where there might also be bugs due to similar issues.
(I note that there was one earlier with BigFloat formatting discussed recently, with locales that swap . and , from the American meaning).
Finally, this is in no way meant as denigrating other people or their work, not that many people are aware of all the issues with dealing with locales, national character sets, encodings, conversions, collation, security, etc.
(just as I am not aware of all of the issues behind the heated discussions of dot vs. inner, adjoint vs. transpose),
and also many people simply aren’t that interested to dig into them (they have enough on their own plates with all the great stuff going in the Julia ecosystem (like all the compiler optimizations, or the new packaging system, for example! )
We all have different areas of expertise, this just happens to be one of mine.
I hope you recall that I’ve always acknowledged the brilliance of all of the Julians I’ve met,
(and you’ve managed to hire quite some of the cream of the crop at JC!), so please stop taking any discussion of bugs, different approaches to string handling, etc. as being “swipes” or “attacks”.
Yes, it’s a bug, and in my opinion the main reason is that Julia cannot handle different encodings inside base or stdlib.
It may not be a problem if I can use other packages such as StringEncodings.jl, but this happens inside of Julia where I can only use base and stdlib.
I think there are two possible solutions.
Using wchar_t version of functions
Make Julia able to convert from/to different encodings.
Luckly, we have wchar_t version of strftime and strptime: wcsftime and wcsptime. If there were not, no way to deal them unless using external packages. As I mentioned, the problem is we cannot use different encodings inside of Julia base/stdlib.
Second solution will solve if there are similar encoding problems in base and stdlib, especially if there is no wchar_t version. It would be possible if Julia uses GNU iconv or ICU.
I’m working on using wcsftime and wcsptime, but still we should consider adding supports for different encodings in base or stdlib.
julia> r = ccall(:wcsptime, Cwstring, (Cwstring, Cwstring, Ref{TmStruct}), timestr, fmt, tm)
ERROR: ccall: could not find function wcsptime
Stacktrace:
[1] top-level scope at .\<missing>:0
I thought wchar.h is a standard library of C or C++, but maybe not included in Julia build.
If Julia cannot include wchar.h, we need second solution.
If it’s a function API from standard C library, it will be in libc.so (or libc.dll) somewhere. That was my thinking.
You’re right, wcsptime is not part of standard library, not in time.h, not in wchar.h. I cannot find it from GNU libc reference or MSDN, though wcsftime exists in time.h.
Edit:
I should differ ISO C standard and POSIX standard because I cannot find strptime from MSDN.
Please have a look at ISO C Standard and POSIX Standard.
strftime and wcsftime are in ISO C <time.h>.
strptime is not ISO C but POSIX <time.h>. From latest GNU libc manual:
21.4.6 Convert textual time and date information back
The ISO C standard does not specify any functions which can convert the output of the
strftime function back into a binary format. This led to a variety of more-or-less successful
implementations with different interfaces over the years. Then the Unix standard was
extended by the addition of two functions: strptime and getdate. Both have strange
interfaces but at least they are widely available.
wcsptime is not in both ISO C and POSIX.
Windows is not POSIX compliant. Cygwin provides POSIX layer, but it is not a compatible layer. We have to recompile some programs from source code. When I compile a program for Cygwin, it sometimes requires cygwin1.dll which makes it unable to run without Cygwin.
Anyway, Julia can call strptime on Windows,
julia> using Libdl
julia> for dl in Libdl.dllist()
if Libdl.dlsym_e(Libdl.dlopen_e(dl), :strptime) != C_NULL
println(dl)
end
end
C:\Users\alkorang\AppData\Local\Julia-0.7.0-alpha\bin\libjulia.dll
julia>