Tangent on radix sort

By “skewed” and non-“balanced” data, are you referring to the case where the bits that discriminate elements occur at the locations closer to the least significant bits? If so, yes, it’s possible to invoke this case in the current implementation of ParallelMSDRadixSort

julia> x = collect(reinterpret(Float64, rand(UInt64.(0:128), 100_000_000)));

julia> @time sort(x, alg = RadixSort);
  1.583404 seconds (8 allocations: 1.490 GiB, 11.20% gc time)

julia> @time sort(x, alg = ThreadsX.QuickSort);
  0.890255 seconds (1.31 M allocations: 1.690 GiB, 5.72% gc time)

julia> @time sort(x, alg = ParallelMSDRadixSort);
  2.374892 seconds (3.65 k allocations: 1.490 GiB, 5.22% gc time)

julia> Threads.nthreads()
8

(Using weirdly generated floats since ThreadsX has parallel counting sort which is much better for integers.)

But I think the flip side of the coin is that MSD is useful for collections of variable-length elements like Vector{String} or Vector{Vector{Int}}. I think it’s also a common kind of distribution since you can have, e.g., a table column with strings where the first few characters of 100 character strings discriminate the elements. I think another upside of MSD over LSD may be that its access pattern is more friendly to nested parallelism that Julia has.

Also, ParallelRadixSort is still not very clever. For example, I think it’d be interesting to fuse the calculation of the maximum element in the first pass of the counting. It’d tell us the most significant discriminating bit position and so it’s possible to skip a few chunks of MSDs with this approach. Or, maybe we can use a quicksort-like approach as the first pass so that it’d narrow down the ranges that subsequence MSD radix sort has to process. I think it’s also straightforward to transform ParallelRadixSort algorithm to multi-pivot quicksort whose performance may be more robust in narrow distributions like the one I used above.

2 Likes

Exactly. the results are as predicted. I’d say skewed data is more common is everyday usage.

I’m not sure if that’s accurate. For floats, even if all exponent bits are identical (i.e., “narrow” distribution), the few first passes can already scan significant digits. That’s why I needed to create a very contrived data set using reinterpret.

I agree that real integer data sets tend to have a very narrow range. But I don’t think it’s relevant since that’s exactly where we should be using counting sort anyway.

Also, as I discussed in the OP, there are various ways to fix it and I think it’s challenging to optimize LSD sort for variable-sized data like strings. That said, I think it’d be interesting to implement LSD sort on GPU.

1 Like