By “skewed” and non-“balanced” data, are you referring to the case where the bits that discriminate elements occur at the locations closer to the least significant bits? If so, yes, it’s possible to invoke this case in the current implementation of `ParallelMSDRadixSort`

```
julia> x = collect(reinterpret(Float64, rand(UInt64.(0:128), 100_000_000)));
julia> @time sort(x, alg = RadixSort);
1.583404 seconds (8 allocations: 1.490 GiB, 11.20% gc time)
julia> @time sort(x, alg = ThreadsX.QuickSort);
0.890255 seconds (1.31 M allocations: 1.690 GiB, 5.72% gc time)
julia> @time sort(x, alg = ParallelMSDRadixSort);
2.374892 seconds (3.65 k allocations: 1.490 GiB, 5.22% gc time)
julia> Threads.nthreads()
8
```

(Using weirdly generated floats since ThreadsX has parallel counting sort which is much better for integers.)

But I think the flip side of the coin is that MSD is useful for collections of variable-length elements like `Vector{String}`

or `Vector{Vector{Int}}`

. I think it’s also a common kind of distribution since you can have, e.g., a table column with strings where the first few characters of 100 character strings discriminate the elements. I think another upside of MSD over LSD may be that its access pattern is more friendly to nested parallelism that Julia has.

Also, `ParallelRadixSort`

is still not very clever. For example, I think it’d be interesting to fuse the calculation of the maximum element in the first pass of the counting. It’d tell us the most significant discriminating bit position and so it’s possible to skip a few chunks of MSDs with this approach. Or, maybe we can use a quicksort-like approach as the first pass so that it’d narrow down the ranges that subsequence MSD radix sort has to process. I think it’s also straightforward to transform `ParallelRadixSort`

algorithm to multi-pivot quicksort whose performance may be more robust in narrow distributions like the one I used above.