Sorting strings by Unicode Collation order?

digital_carver · May 28, 2018, 6:48am

Is there a way to sort strings by the Unicode specified Collation order in Julia? Currently, a plain sort seems to sort just by codepoint order, for eg.


julia> current_order = sort(["மனம்", "மயம்", "மகம்", "மறம்", "மலம்"])
5-element Array{String,1}:
 "மகம்"
 "மனம்"
 "மயம்"
 "மறம்"
 "மலம்"

julia> expected_order = ["மகம்", "மயம்", "மலம்", "மறம்", "மனம்"]
5-element Array{String,1}:
 "மகம்"
 "மயம்"
 "மலம்"
 "மறம்"
 "மனம்"

julia> current_order .== expected_order
5-element BitArray{1}:
  true
 false
 false
  true
 false

digital_carver · May 28, 2018, 12:12pm

I looked a bit into what a few other major languages do regarding Collation.

Python apparently leaves it to external libraries (eg: pyuca), Ruby too (ICU, TwitterCLDR, etc.).

Perl’s sort uses collation algorithm when use locale is set to a UTF-8 locale.

Java does collation with the Collator class, with the locale specified in the constructor.

C# has Globalization.SortKey that uses the collation algorithm depending on the CurrentCulture (locale) setting.

It makes sense for the default sort to do it the simple, efficient, and locale-independent way (the current comparator seems to boil down to c = ccall(:memcmp, Int32, (Ptr{UInt8}, Ptr{UInt8}, UInt), a, b, min(al,bl)) in cmp(::String, ::String)). But it would be useful to have the option of using a sort that uses a collation algorithm, perhaps as an overload in the stdlib Unicode module.

ScottPJones · May 28, 2018, 12:13pm

There are all sorts of collation orders, what in particular were you expecting?
Generally, anything more complex than sorting normalized strings by the Unicode codepoints needs to be locale specific (French in particular has very complicated rules for what they consider correct for dictionaries and phone books).

ScottPJones · May 30, 2018, 5:20pm

Collation is a very complex task, with many different collations used even within a single locale.
Probably the best thing currently available in Julia would be to use ICU or StrICU (as soon as that’s registered, hopefully soon), as you noted, other languages use the ICU library, while large, it’s pretty much the gold standard for dealing with complex Unicode issues.

Topic		Replies	Views
Sorting strings containing numbers, so that "A2" < "A10"? General Usage strings , sort	28	6974	August 16, 2021
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2492	January 12, 2024
Dictionary values ascending or descending New to Julia dictionary , sorting	4	1864	January 12, 2022
[ANN]: WIP Strs.jl package ready for alpha review and testing Community	15	3259	April 3, 2018
Unicode 15.0 (beta) and sorting/collation Offtopic	1	554	July 2, 2022

Sorting strings by Unicode Collation order?

Related topics