Before going to sleep I tried these two ideas, expecting nothing but…
String sorting performance vs R is only marginally (within 1.x-1.3x) slower now! In my benchmarks sorting 10m records where 100k of them are distinct yields 1.31secs (Julia) vs 1.1secs (R) (for 100m it’s around 25-27 seconds vs 18 -21 seconds).

I made these 3 changes
- Use my own sort function instead of
Base.sortwhich is hard to overload asBase.sortdoesn’t dispatch onalg, or more generally on named arguments. Some howBase.sort(svec, alg = StringRadixSort)is running some code which probably wasn’t all that necessary. - Change the
SortingAlgorithms.RADIXSIZEto 16 from 11. This means sort 16 bits in one go instead of 11 bits. After getting comfortable with radixsort, I was curious whySortingAlgorithms.jlchose 11 as theRADIXSIZE, the most common types are probably multiples of 16 bits, e.g. UInt32 is 32 bits and UInt64 is 64 bits. Sorting 11 bits in one go, we need 3 passes over the data to sort 32 bit type and 6 passes over the data for a 64 bit type vs 2 and 4 passes if we just choose aRADIXSIZEof 16 bits. From benchmarking, increasing theRADIXSIZEresults in faster sorting too. - Do not load underlying bits into
UInt128and instead useUInt64as the maximum. I found that it’s quicker to sort twice usingUInt64instead of sorting onceUInt128. When I said sort “once” the data has been gone through 4 times when sorting the underlying bytes as 64bit; so really the number of passes over the data is the same whether you useUInt128orUInt64but it seems Julia can handle 64bit data better which makes sense given that I am running on a 64bit machine.