Request for comments: a roadmap to faster sort() for primitive types using RadixSort

I believe it is I who made the original misunderstanding. When you said “if a pass has only one element to sort” I thought you meant every element was identical, but I now think you meant that every element falls into the same bucket. Because of the compression passes, I think that we avoid these wasteful passes much sooner and don’t even perform a count on them. The only case I can think of where this would come up is sorting something like this [101000000, 11000000, 011000000] or [10100000000001, 11100000000010, 100000000101] where there is a chunk of bits that are all the same. I doubt this is a common enough case to add that check for, but if someone can provide a real-world dataset with measurable performance impact, I’d have to think harder about it.

I’ve implemented LSD radix (i.e. first sort by the least significant chunk, then by the second least significant, then the third, so on and so-forth) so that we do not need O(n) function calls—only O(log(n)). The first pass sorts the entire array as one, the second does the same, but looks at slightly more significant bits.

counts = Vector{BinType}(undef, 2^chunk_size+1)
mask = 2^chunk_size-1

for shift in 0:chunk_size:bits-1

    counts .= zero(BinType)

    for x in vs
        idx = (x >> shift)&mask + 2
        counts[idx] += one(BinType)
    end

    sum = counts[1]
    for i in 2:2^chunk_size
        sum += counts[i]
        counts[i] = sum
    end

    for x in vs
        i = (x >> shift)&mask + 1
        j = counts[i] += 1
        ts[j] = x
        #sortperm here
    end

    vs, ts = ts, vs

end

Outside of loop breaking, it happens to be entirely branch-free, but I doubt that makes a substantial difference because the branches associated with special case detection would also be easily predicted by a branch-predictor, and where there were mis-predictions, the special case would make it worth it.

This is precisely what I specify in the docstrings of my serialization methods (see OP)

No. I do not define any types for serialization (see OP for implementation, if you’d like, though later I will have a heavily commented implementation for a PR that will be more readable)

Yes. I serialize Vector{Int32} to Vector{UInt32} and Vector{Float64} to Vector{UInt64}.

This is a valid objection. And indeed these two passes are both unnecessary. The first serialization can be a serialize & radix pass and the deserialization can also be a deserialize & radix pass. Given the marked performance improvements of the system I have now over what is currently in Julia/Base I have decided to focus on simplicity & readability, but I imagine removing these extra array access at some point.

A slightly more accurate runtime analysis is N/n + WEIGHT * 2^n, which indicates lambert’s w instead of log2. I currently minimize N/n + 2^n but I plan to perform extensive benchmarking and potentially reconsider.

LSD radix avoids this entirely

Which places the approximation at strictly more comparisons than getting the right answer the first time. Do you think that the middle 90% of many datasets can be represented in substantially lower bit-width than the entire dataset?

As I’ve said before, this is not for variable width sorting (including string sorting).

You have alluded multiple times to a “way that you do it” do you happen to have a legible implementation I could read? Ideally with the ability to benchmark it or implemented in Julia? If you don’t have an implementation, but do have a full algorithm in mind, it might help clarify what you mean if you describe your whole idea.

A lot of the questions you raise would be easily answered with benchmarking, and I am currently working on a sorting benchmarking package (this portion of the project is on hold until that benchmark+test suite is ready to go). Hopefully I can make the implementation legible enough for you to enjoy reading and then you can make your proposals concretely and you or I can test if they result in a runtime improvement!