Is there a way to sort strings by the Unicode specified Collation order in Julia? Currently, a plain sort
seems to sort just by codepoint order, for eg.
julia> current_order = sort(["மனம்", "மயம்", "மகம்", "மறம்", "மலம்"])
5-element Array{String,1}:
"மகம்"
"மனம்"
"மயம்"
"மறம்"
"மலம்"
julia> expected_order = ["மகம்", "மயம்", "மலம்", "மறம்", "மனம்"]
5-element Array{String,1}:
"மகம்"
"மயம்"
"மலம்"
"மறம்"
"மனம்"
julia> current_order .== expected_order
5-element BitArray{1}:
true
false
false
true
false
I looked a bit into what a few other major languages do regarding Collation.
Python apparently leaves it to external libraries (eg: pyuca), Ruby too (ICU, TwitterCLDR, etc.).
Perl’s sort
uses collation algorithm when use locale
is set to a UTF-8 locale.
Java does collation with the Collator
class, with the locale specified in the constructor.
C# has Globalization.SortKey
that uses the collation algorithm depending on the CurrentCulture
(locale) setting.
It makes sense for the default sort
to do it the simple, efficient, and locale-independent way (the current comparator seems to boil down to c = ccall(:memcmp, Int32, (Ptr{UInt8}, Ptr{UInt8}, UInt), a, b, min(al,bl))
in cmp(::String, ::String)
). But it would be useful to have the option of using a sort that uses a collation algorithm, perhaps as an overload in the stdlib Unicode
module.
There are all sorts of collation orders, what in particular were you expecting?
Generally, anything more complex than sorting normalized strings by the Unicode codepoints needs to be locale specific (French in particular has very complicated rules for what they consider correct for dictionaries and phone books).
Collation is a very complex task, with many different collations used even within a single locale.
Probably the best thing currently available in Julia would be to use ICU
or StrICU
(as soon as that’s registered, hopefully soon), as you noted, other languages use the ICU library, while large, it’s pretty much the gold standard for dealing with complex Unicode issues.