Request for comments: a roadmap to faster sort() for primitive types using RadixSort

I have a fast radix sort. Perhaps 1.2–2x faster than SortingAlgorithms’s radix sort according to my very limited benchmarks. SorthingAlgorithms’ is also quite a bit faster than Base’s current defaults, putting the total speedup around 2–7x, again, according to very limited benchmarks.

This only works for a small subset of types

Specifically, mine works only for AbstractVector{UIntN}, and with a thin, two pass pre- and one pass post- processor also works for AbstractVector{IntN}, AbstractVector{FloatN}.

My plan is to implement a trait serializable(::Type)* and functions serialize()/deserialize() that convert things uniquely to and back from an Unsigned serialization that preserves isless, and on sort!(v::AbstractVector{T}), check if serializiable(T), and if so, dispatch to a serialized radix sort which works as follows:

  1. First pass serializing each element in the input vector (serialize(x) is guaranteed to return a serialization that fits anywhere an x could fit (i.e. is smaller in memory)) & at the same time calculating the maximum and minimum serialization (these are both Unsigned).
  2. Report the minimum and maximum to a heuristic which
    a) Decides whether to dispatch to counting sort
    b) Counts how many bits are in max and in max-min
    c) Decides whether it is worth it to conduct a second pass (see 3)
    d) Decides how big chunk_size should be
  3. Conduct a second compression pass
    a) Maybe subtract out the minimum (an optimization for the common case of sorting Signed which also helps with sorting arrays of positive, floating-point values)
    b) Maybe also copy the vector into a smaller datatype
  4. Sort using the radix sort algorithm (this takes about 80% of runtime in my profiling)
  5. Decompress & deserialize in one pass

*serialize functions also take an order::Base.Order.Ordering as an input argument, and, at least at first, only DirectOrderings would be serializable. Here is a tested draft of serialization.

Questions & Suggestions welcome!

7 Likes

I think step 1 is definitely make a PR to Base. We want fast sorts built in, and this will be a good way to get feedback on it.

4 Likes

No, I wouldn’t recommend this. RadixSort was originally added to Base and then moved to SortingAlgorithms.jl along with the other sorting variants. SortingAlgorithms.jl is probably the right place to submit a pull request for a new sorting variant/optimization.

See also the discussion in sort/sortperm has poor performance · Issue #939 · JuliaLang/julia · GitHub

4 Likes

imo if we are leaving significant performance on the table, it should be in Base. I at least would push pretty hard on triage for this. sorting matters, and we should be the best at it by default.

6 Likes

Radix sorting sacrifices a lot of generality (e.g. it’s mainly for the standard sort order); usually we put the more general solutions in Base and the specialized cases in packages. Packages are also a much faster way to get new code to users, and a more flexible way to experiment when it’s not clear what functionality is the most useful.

5 Likes

What is “triage” in this context?

Roughly biweekly call (accessible to everyone! Just join the triage channel in Slack) where folks consider whether or not to approve changes.
See also the triage label in giuthub for the triage issue list.

7 Likes

While that’s true in general, we do have some very specific operations in Base that are highly optimized for specific types. The example that comes to mind is matrix multiplication, and sorting is probably more widely used (although maybe not yet within the Julia community).

4 Likes

I agree with Oscar here: it makes sense to me to make the standard sorting of common primitive types really fast even if it introduces more sorting algorithms in base and variability in sort speeds. Those are the sorts that are used the most.

7 Likes

My 2 cents.

Float64, Int64 etc are in Base so sorts that optimize for them should be in base. When I first attempted I was new to the OS world and didn’t know how to push it forward. There are several tricky things to consider e.g. how to treat the by and how to make the radixsort as general as possible.

Also, it feels like the current base sort are have an API that is designed (to skews towards) comparison based sort so that makes the radixsort intro to Base a bit more tricky.

Another example is that R has incorporate radixsort into base.

But there are lots of algorithms where matrix multiplication is the performance-critical step, and a lot fewer where sorting is dominant: if you are sorting arrays over and over, probably you are using the wrong data structure. And sorting arrays of fixed-width types over and over is even more specialized.

(Sorting vast on-disk/distributed datasets is a different matter—even doing that once can take a significant amount of time—but that’s not addressed by an in-memory radix sort and is such a specialized application, dependent on the specific data format, that it also probably belongs in packages.)

Sorting is a classic problem in computer science that is interesting for its own sake, but once you’ve gotten it down to O(n log n) with a reasonable constant factor, it seems like usually other parts of your algorithm start to dominate the time. Whereas O(n³) dense-matrix operations are so expensive — strongly super-linear in the O(n²) amount of data — that they tend to dominate the time in programs where they appear for large n.

5 Likes

Also @quinnj has been looking into radix sort lately, at least for strings:

Maybe he can share his experience about whether or not it’s a good general default

Something to keep in mind is that we generically encourage people to employ in-place functions like sort! in performance-critical applications, but radix sort is out-of-place so you would need to know about (and probably benchmark) it to intentionally choose sort instead. It really seems like the kind of algorithm that you need to choose explicitly, for a subset of performance-critical problems where radix sorting is appropriate, and in that case there is much less advantage to having it in Base vs a package.

It’s a tradeoff. Adding code to Base makes the “default” faster without users having to think about it, but if they have to choose it explicitly anyway that’s less of a win. Having it in Base tends to bring the code under more scrutiny and can lead to it being better maintained in the long run; it also relieves users of the mental effort of choosing among a bunch of competing packages. On the other hand, it requires Base to support the same API in perpetuity even if a better approach comes along, slows down the rate of progress to coincide with Julia releases, and discourages users from looking for alternative packages that may be better (for many users, “built-in” == “best”) … the latter is especially pertinent for sorting, where there are many algorithms and which one is best is often problem-dependent.

We had a radix sort in Base at one point, but it was moved into the SortingAlgorithms.jl package. What has changed?

4 Likes

At the very least it would be useful (maybe as a first step) to improve RadixSort in SortingAlgorithms to make it faster (if you say that’s possible) and extensible (currently there’s no mechanism to declare whether a type supports it or not). Unfortunately, SortingAlgorithms doesn’t get a lot of attention so it’s not always easy to get PR reviewed (like mine about radix sort).

data.table in R is famous for its split-apply-combine performance, which is based on radix sort, so apparently sorting performance does matter in some areas. That said, I’ve not been able to get a speedup in DataFrames by using radix sort for grouping, but I’m not sure whether that’s because our radix sort isn’t fast enough yet or because our other grouping methods are already fast. :slight_smile:

7 Likes

Nice.

What do you have in mind (with “the standard sort order”)? You mean ascending; codepoint/ASCIIbetical order? I have my own idea to fix that, by storing an (8-byte) prefix of strings, but with it modified to be suitable for better ordering. Should be compatible with radix-based sorting (but yes, you only get one such standard order, from a global locale setting, other most often not needed, except for “same” descending order that should neither be a problem).

It doesn’t have to be, you can use in-place radix sort or some variant of it, e.g. this Ska sort variant I just discovered:

this is of course a version of radix sort. Meaning its complexity is lower than O(n log n). I made two contributions:

  1. I optimized the inner loop of in-place radix sort. I started off with the Wikipedia implementation of American Flag Sort and made some non-obvious improvements. This makes radix sort much faster than std::sort, even for a relatively small collections. (starting at 128 elements)

  2. I generalized in-place radix sort to work on arbitrary sized ints, floats, tuples, structs, vectors, arrays, strings etc. I can sort anything that is reachable with random access operators like operator[] or std::get. If you have custom structs, you just have to provide a function that can extract the key that you want to sort on. This is a trivial function which is less complicated than the comparison operator that you would have to write for std::sort.

[…]
My other best case is on already sorted sequences: In that case I iterate over the data exactly twice, once to look at each item, and once to swap each item with itself.

The worst case for my implementation can only be reached when sorting variable sized data, like strings.
[…]
In summary, Ska Sort:

  • Is an in-place radix sort algorithm
  • Sorts one byte at a time (into 256 partitions)
  • Falls back to std::sort if a collection contains less than some threshold of items (currently 128)
  • Uses the inner loop of American Flag Sort if a collection contains less than a larger threshold of items (currently 1024)
    […]
  • Automatically converts signed integers, floats and char types to the correct unsigned integer type
  • Automatically deals with pairs, tuples, strings, vectors and arrays by sorting one element at a time
  • Skips over common prefixes of collections. (for example when sorting strings)
  • Provides two customization points to extract the sort key from an object: A function object that can be passed to the algorithm, or a function called to_radix_sort_key() that can be placed in the namespace of your type

So Ska Sort is a complicated algorithm. Certainly more complicated than a simple quick sort. One of the reasons for this is that in Ska Sort, I have a lot more information about the types that I’m sorting. In comparison based sorting algorithms all I have is a comparison function that returns a bool. In Ska Sort I can know that “for this collection, I first have to sort on a boolean, then on a float” and I can write custom code for both of those cases.
[…]
PROBLEMS
Ska_sort isn’t perfect and it has problems. I do believe that it will be faster than std::sort for nearly all data, and it should almost always be preferred over std::sort.

The code here has a good license, and a bit over 1400 lines: GitHub - skarupke/ska_sort

See also performance graphs for above, also here:

Others to consider:

There are two codebases I consider as important to be aware of, both MSD Radix Sorts (I’m in the LSD camp), but both critically relevant. […]
For smaller inputs (and particularly non-uniform inputs) I start coming out ahead (and will be getting even better). When I say I run faster, I mean A LOT faster.

Another candidate: ("Ultra fast MSD radix sorter), at over 14000 lines I’m not sure we want such large one in base (and it’s GPLed): RADULS/sorting_network.cpp at master · refresh-bio/RADULS · GitHub

1 Like

! signals mutation, not in-place-ness. sort is guaranteed not to modify its input. sort! is guaranteed to produce a sorted array in the original input vector. Neither is guaranteed to run in place (and indeed @btime sort!(v) setup = (v=rand(["hello","world"], 1000) uses merge sort which is not in place. Radix would behave similarly to merge sort.

Perhaps that is a good upgrade. The algorithm I’m currently using is only 33 sloc

and gets pretty good performance improvements. If this works out, we could probably replace it with a longer better optimized algorithm, but for now I’m trying to scope this with “doable” and “good” preeminent without “best” or “optimal” as objectives.

I’ll stay out of this discussion for now, but while I’m drafting I’ve made a PR to SortingAlgorithms.jl because their CI tests run faster, and for me developing on Julia Base is a bit of an inconvenience (revise works 95% of the time, but if and when I need to rebuild Julia it takes upwards of an hour on my computer). While this PR is nominally targeted at SortingAlgorithms.jl, I plan to stay within the SymVer guarantees of Julia and the existing API, so it should be easy to put in either place.

3 Likes

While this is a total rewrite, it is heavily inspired by SortingAlgorithms’.

For folks familiar with internals, if views were as fast to work with as manually passing around lo and hi issue, the whole sorting system could feel a bit more idiomatic and nicer to work with.

1 Like

Google Scholar for scholarly bibliography and reading list on fast radix sorting.
The Art
Of Computer Programming Volume 3 Sorting and Searching Second Edition Donald E Knuth for a very detailed discussion on radix sorting.

1 Like

Thank you @iajzenszmi! The particular niche I’m trying to fill with this PR now is in-memory sorting (i.e. sorting datasets that fit into RAM). I found the external radix sorting section 5.4.7 on page 343, but couldn’t find a section on in-memory radix sorting. Could you point it out to me, please?

EDIT: found it. 5.2.5 on page 168 “sorting by distribution”

According to my reading, Knuth presents 1) the classic LSD radix algorithm I’ve implemented in the linked PR, 2) an option to combine distributing & counting the next bucket sizes, and 3) a queue implementation to replace counting altogether.

When deciding what to implement, I chose (1) because I suspected that (2) and (3) yield no more than small constant factor improvements, both make the code longer and less intuitive, and I suspect but have not tested, that (2), and perhaps (3) increase the minimum N for which radix is faster than quicksort or mergesort.

In my reading, Knuth presented a simple algorithm and a more complex algorithm with the same theoretical runtime and asserted that the complex algorithm was faster without empirical evidence. Do you happen to know of any real-world comparisons of these two or three methods? Because if significant performance improvements are available, it may be worth adopting a more complex method.

Alternatively, feel free to implement one of those methods and benchmark it against the one I’m proposing—if it’s faster it would probably make sense to use it instead!

In theory, it should be easy to swap out the underlying radix sort algorithm while keeping all the same plumbing to make it compatible with Julia’s comparison-based paradigm.