Faster default Base.sort! for Julia

There are functions like factorial where Julia uses a factorial_lookup function to directly return results from a table when the value is lower or equal to 20.

Playing around with sorting algorithms I built a simple one that takes into account the size of the vector to sort just like factorial takes into a account the value passed to calculate its factorial.

As it turns out the algorithm taking into account size is consistently faster than the default for Julia up to a size of five elements, hence my question, why does not Julia uses the same strategy for sort! as for factorial considering how critical sorting algorithms are in computing?

Below follows the code and some benchmarks:

function sort2!(x)
    swap(i,j) = x[j], x[i] = x[i], x[j]
    x[1]>x[2] && swap(1,2)
end

function sort3!(x)
    swap(i,j) = x[j], x[i] = x[i], x[j]
    if x[1]>x[2] sort2!(x) end
    if x[2]>x[3] swap(2,3); sort2!(x) end
end

function sort4!(x)
    swap(i,j) = x[j], x[i] = x[i], x[j]
    if x[1]>x[2] sort2!(x) end
    if x[2]>x[3] swap(2,3); sort2!(x) end
    if x[3]>x[4] swap(3,4); sort3!(x) end
end

function sort5!(x)
    swap(i,j) = x[j], x[i] = x[i], x[j]
    if x[1]>x[2] sort2!(x) end
    if x[2]>x[3] swap(2,3); sort2!(x) end
    if x[3]>x[4] swap(3,4); sort3!(x) end
    if x[4]>x[5] swap(4,5); sort4!(x) end
end

I am just using this algorithm to test the “factorial” idea, I haven’t done any theoretical work on it to guarantee it is sound, to make sure it works I compared its results multiple times with Base.sort!

Comparing benchmarks for vector of length 3 we have:

sort!:        117.439 ns ± 42.243 ns
sort3!:      103.014 ns ± 76.325 ns

Comparing benchmarks for vector of length 5 we have:

sort!:      135.428 ns ± 43.606 ns
sort5!:    128.527 ns ± 60.103 ns

Comparing benchmarks for vector of length 6 we have:

sort!:      148.199 ns ± 45.349 ns
sort6!:    151.984 ns ± 59.283 ns

If I’m not mistaken, you’ve basically rediscovered sorting networks.

I guess the simple reason is that when sorting vectors of such short length, 17 nano seconds difference is basically a rounding error, especially with other noise going on in a real-life workload. It’s more important to do these kinds of optimizations with e.g. factorial or fibonacci implementations, because it vastly reduces the time common calls take. In contrast, most calls to sort! have much longer arguments so this would be optimizing the very rare case.

2 Likes

Oh, wow, I didn’t know about this, very interesting. Thank you for sharing!

The difference is consistent, they’re faster, also I just read in the article you provided that sorting networks can be parallelized! In the article we can also read:

An optimal network for size 11 was found in December 2019 by Jannis Harder

Why don’t we parallelize and use that one in Julia?

That depends on what field of science you’re working on, what is rare for some might be common for some others… also, we can always challenge other languages to see how fast they sort short vectors :smile:

That can be a nice micro optimization.

It looks like you’ve been bitten by the sorting bug. For more on sorting algorithms, see this thread. I posted 4 links you might find of interest.

At the bottom of the referenced link, I post a couple of links to algorithms that use SIMD and multiple threads. That would be the ultimate sort right now for primitive types. The way to get such a sort incorporated into base would be to first release a package with it. After a while, once it’s been debugged, and lots of people are finding it useful, there is a chance it could be pulled into base.

Or … maybe it just stays in a package. Packages work great. Although it can be tedious to load two dozen packages and re-add them for Julia releases. I can’t say I know of a better alternative. Package compiler can help somewhat.

It also depends on the philosophy of Julia. I created a post asking about that, but it didn’t get much traction. Do we want everything to be eventually SIMD and multithreaded? Or do people get annoyed when functions launch threads and prefer to have single threaded, unless they ask otherwise?

Micro optimization would be low-level optimizations that do not change the overall structure of the program.

Sorting Networks do change the structure of the program, sure we can then micro optimize them as well, but we’re not talking anything low level now, we are talking about adding a faster algorithm for short arrays than the Julia sort! function is using and dispatch those algorithms based on the size of the vector.

2 Likes