[ANN] TupleSorting: Sort tuples efficiently and with good type inference

TupleSorting is being registered. Use it to sort tuples (even large ones) efficiently and with good type inference:

julia> using TupleSorting

julia> const TS = TupleSorting
TupleSorting

julia> t = ntuple((_ -> rand(0:9)), Val(25))
(3, 2, 1, 0, 5, 0, 3, 0, 3, 7, 2, 8, 1, 3, 5, 7, 6, 6, 0, 9, 1, 8, 3, 1, 9)

julia> sorted(t)
(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9)

julia> sorted(t, StaticLengthSortOblivious{20,TS.SmallSize}())
(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9)

julia> sorted(t, Base.Order.Reverse)  # reverse order
(9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 3, 3, 3, 3, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0)

julia> sorted(t, StaticLengthSortStable(), Base.Order.Forward)  # default arguments
(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9)

The implementation is basically a recursive merge sort, with several different merge algorithms available, both stable and unstable.

In the future I’d like to get StaticArrays to depend on TupleSorting for sorting. It could also make sense to move the code into Base.Sort.

Edit: Further work needed done

After some research and performance measurements, I’ve come to realize something. Even though optimal sorting networks are only known for very small sizes, from a performance standpoint, when the input length is known in advance, it makes sense to use algorithmically generated sorting networks even for relatively large input lengths. I had already known this holds for GPU kernels, but it surprised me that this is also true for execution on a CPU. For example, when sorting a tuple of 50 Int32 values on a recent AMD CPU, using a bitonic sort specialized for the input length is 10x faster than using quick sort and 5x faster than using TupleSorting.sorted!

Bitonic sort is but one of the algorithms for generating sorting networks. The nice thing about it is that it’s already implemented and provided for sorting StaticVectors by StaticArrays. However quite some other algorithms are available, including very recent work. EDIT: finally I decided to implement two algorithms from Knuth’s TAOCP Volume 3 Chapter 5. One produces networks of small size, and the other produces networks of small depth, although they should produce identical network for power-of-two lengths. The latter algorithm should be strictly better than bitonic sort. Optionally they’re augmented with the optimal networks from OSN.

NB: even though such huge sorting networks are great for sorting many tuples/static vectors of the same small length, don’t try and use it for sorting many different input lengths in a hot loop. That would probably slow down your application because of thrashing the CPU instruction cache.

12 Likes

That’s very nice! Thanks for this.

Since you didn’t mention it, I’ll post benchmark comparisons relative to Base.sort

julia> @btime sort($t)    # Using Base
  159.548 ns (1 allocation: 256 bytes)
(0, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 7, 7, 7, 8, 9)

julia> @btime sorted($t)    # Using TupleSorted
  158.396 ns (0 allocations: 0 bytes)
(0, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 7, 7, 7, 8, 9)

julia> @btime sorted($t, StaticLengthSortOblivious{20,TupleSorting.SmallSize}())
  61.165 ns (0 allocations: 0 bytes)
(0, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 7, 7, 7, 8, 9)
1 Like

Perhaps StaticLengthSortOblivious{length(t), TS.SmallSize} should be the default method? It seems to be uniformly faster on my machine

I followed Julia Base, whose sort algorithm also defaults to a stable sort.

BTW, the first parameter is independent from the tuple length. It specifies the degree to which the base cases of the recursion should be accelerated using the optimal sorting networks from OptimalSortingNetworks.

1 Like