Request for comments: a roadmap to faster sort() for primitive types using RadixSort

Lilith · November 6, 2021, 12:38am

I believe it is I who made the original misunderstanding. When you said “if a pass has only one element to sort” I thought you meant every element was identical, but I now think you meant that every element falls into the same bucket. Because of the compression passes, I think that we avoid these wasteful passes much sooner and don’t even perform a count on them. The only case I can think of where this would come up is sorting something like this [101000000, 11000000, 011000000] or [10100000000001, 11100000000010, 100000000101] where there is a chunk of bits that are all the same. I doubt this is a common enough case to add that check for, but if someone can provide a real-world dataset with measurable performance impact, I’d have to think harder about it.

I’ve implemented LSD radix (i.e. first sort by the least significant chunk, then by the second least significant, then the third, so on and so-forth) so that we do not need O(n) function calls—only O(log(n)). The first pass sorts the entire array as one, the second does the same, but looks at slightly more significant bits.

counts = Vector{BinType}(undef, 2^chunk_size+1)
mask = 2^chunk_size-1

for shift in 0:chunk_size:bits-1

    counts .= zero(BinType)

    for x in vs
        idx = (x >> shift)&mask + 2
        counts[idx] += one(BinType)
    end

    sum = counts[1]
    for i in 2:2^chunk_size
        sum += counts[i]
        counts[i] = sum
    end

    for x in vs
        i = (x >> shift)&mask + 1
        j = counts[i] += 1
        ts[j] = x
        #sortperm here
    end

    vs, ts = ts, vs

end

Outside of loop breaking, it happens to be entirely branch-free, but I doubt that makes a substantial difference because the branches associated with special case detection would also be easily predicted by a branch-predictor, and where there were mis-predictions, the special case would make it worth it.

This is precisely what I specify in the docstrings of my serialization methods (see OP)

No. I do not define any types for serialization (see OP for implementation, if you’d like, though later I will have a heavily commented implementation for a PR that will be more readable)

Yes. I serialize Vector{Int32} to Vector{UInt32} and Vector{Float64} to Vector{UInt64}.

This is a valid objection. And indeed these two passes are both unnecessary. The first serialization can be a serialize & radix pass and the deserialization can also be a deserialize & radix pass. Given the marked performance improvements of the system I have now over what is currently in Julia/Base I have decided to focus on simplicity & readability, but I imagine removing these extra array access at some point.

A slightly more accurate runtime analysis is N/n + WEIGHT * 2^n, which indicates lambert’s w instead of log2. I currently minimize N/n + 2^n but I plan to perform extensive benchmarking and potentially reconsider.

rryi:

One extreme case is, that almost all elements are in one bucket, and this continues in the next passes, At first glance, it looks like a wasted pass, and it may happen if you do not calculate min and max. Having quicksort in mind, one might think it is the worst case, like in quicksort if you choose always min or max as pivot element and performance degenerates to O(N*N). But this is not true, on the contrary: if most buckets have 0 or 1 element, this is a lucky case, because most subpasses are trivial, the remaining big bucket has complexity O(N), according to my heuristic. We arrive at O(log(M)*N) total complexity, if all elements are equal and this in one bucket.

The other extreme is:elements are uniformly spread over buckets, each bucket has about N>>n elements. We have 2^n subproblems to solve.

LSD radix avoids this entirely

Which places the approximation at strictly more comparisons than getting the right answer the first time. Do you think that the middle 90% of many datasets can be represented in substantially lower bit-width than the entire dataset?

As I’ve said before, this is not for variable width sorting (including string sorting).

You have alluded multiple times to a “way that you do it” do you happen to have a legible implementation I could read? Ideally with the ability to benchmark it or implemented in Julia? If you don’t have an implementation, but do have a full algorithm in mind, it might help clarify what you mean if you describe your whole idea.

A lot of the questions you raise would be easily answered with benchmarking, and I am currently working on a sorting benchmarking package (this portion of the project is on hold until that benchmark+test suite is ready to go). Hopefully I can make the implementation legible enough for you to enjoy reading and then you can make your proposals concretely and you or I can test if they result in a runtime improvement!

Topic		Replies	Views
10x faster sortperm() Performance sortperm	10	2689	October 29, 2024
WIP: faster string sort Internals & Design strings , sort	92	12359	February 8, 2018
Making sorting a tree represented by vector of any types faster Performance	1	184	December 7, 2023
Proposal to update to sortperm signature Internals proposal , sort , sortperm	6	1170	November 26, 2017
Progress towards faster `sortperm` for Strings Data performance , sortperm	16	4642	January 29, 2018

Request for comments: a roadmap to faster sort() for primitive types using RadixSort

Related topics