Sorting strings by Unicode Collation order?


#1

Is there a way to sort strings by the Unicode specified Collation order in Julia? Currently, a plain sort seems to sort just by codepoint order, for eg.


julia> current_order = sort(["மனம்", "மயம்", "மகம்", "மறம்", "மலம்"])
5-element Array{String,1}:
 "மகம்"
 "மனம்"
 "மயம்"
 "மறம்"
 "மலம்"

julia> expected_order = ["மகம்", "மயம்", "மலம்", "மறம்", "மனம்"]
5-element Array{String,1}:
 "மகம்"
 "மயம்"
 "மலம்"
 "மறம்"
 "மனம்"

julia> current_order .== expected_order
5-element BitArray{1}:
  true
 false
 false
  true
 false


#2

I looked a bit into what a few other major languages do regarding Collation.

Python apparently leaves it to external libraries (eg: pyuca), Ruby too (ICU, TwitterCLDR, etc.).

Perl’s sort uses collation algorithm when use locale is set to a UTF-8 locale.

Java does collation with the Collator class, with the locale specified in the constructor.

C# has Globalization.SortKey that uses the collation algorithm depending on the CurrentCulture (locale) setting.

It makes sense for the default sort to do it the simple, efficient, and locale-independent way (the current comparator seems to boil down to c = ccall(:memcmp, Int32, (Ptr{UInt8}, Ptr{UInt8}, UInt), a, b, min(al,bl)) in cmp(::String, ::String)). But it would be useful to have the option of using a sort that uses a collation algorithm, perhaps as an overload in the stdlib Unicode module.


#3

There are all sorts of collation orders, what in particular were you expecting?
Generally, anything more complex than sorting normalized strings by the Unicode codepoints needs to be locale specific (French in particular has very complicated rules for what they consider correct for dictionaries and phone books).


#4

Collation is a very complex task, with many different collations used even within a single locale.
Probably the best thing currently available in Julia would be to use ICU or StrICU (as soon as that’s registered, hopefully soon), as you noted, other languages use the ICU library, while large, it’s pretty much the gold standard for dealing with complex Unicode issues.