TupleSorting is being registered. Use it to sort tuples (even large ones) efficiently and with good type inference:

```
julia> using TupleSorting
julia> const TS = TupleSorting
TupleSorting
julia> t = ntuple((_ -> rand(0:9)), Val(25))
(3, 2, 1, 0, 5, 0, 3, 0, 3, 7, 2, 8, 1, 3, 5, 7, 6, 6, 0, 9, 1, 8, 3, 1, 9)
julia> sorted(t)
(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9)
julia> sorted(t, StaticLengthSortOblivious{20,TS.SmallSize}())
(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9)
julia> sorted(t, Base.Order.Reverse) # reverse order
(9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 3, 3, 3, 3, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0)
julia> sorted(t, StaticLengthSortStable(), Base.Order.Forward) # default arguments
(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9)
```

The implementation is basically a recursive merge sort, with several different merge algorithms available, both stable and unstable.

In the future I’d like to get StaticArrays to depend on TupleSorting for sorting. It could also make sense to move the code into `Base.Sort`

.

## Edit: Further work ~~needed~~ done

After some research and performance measurements, I’ve come to realize something. Even though optimal sorting networks are only known for very small sizes, from a performance standpoint, when the input length is known in advance, it makes sense to use algorithmically generated sorting networks even for relatively large input lengths. I had already known this holds for GPU kernels, but it surprised me that this is also true for execution on a CPU. For example, when sorting a tuple of 50 `Int32`

values on a recent AMD CPU, using a *bitonic sort* specialized for the input length is 10x faster than using quick sort and 5x faster than using `TupleSorting.sorted`

!

Bitonic sort is but one of the algorithms for generating sorting networks. The nice thing about it is that it’s already implemented and provided for sorting `StaticVector`

s by `StaticArrays`

. However quite some other algorithms are available, including very recent work. EDIT: finally I decided to implement two algorithms from Knuth’s TAOCP Volume 3 Chapter 5. One produces networks of small size, and the other produces networks of small depth, although they should produce identical network for power-of-two lengths. The latter algorithm should be strictly better than bitonic sort. Optionally they’re augmented with the optimal networks from OSN.

NB: even though such huge sorting networks are great for sorting many tuples/static vectors of the same small length, don’t try and use it for sorting many different input lengths in a hot loop. That would probably slow down your application because of thrashing the CPU instruction cache.