Taking sorting & ordering seriously

Hi all,

I’m happy to share this proof of concept about refactoring the sorting and ordering API in Julia:

SortingSortingOut.jl

The basic ideas are:

  1. To make Ordering objects more composable and reusable
  2. To figure out a better way to dispatch on efficient sorting algorithms.

The following is one of the motivating examples where sort! does not dispatch on an efficient floating point sorting algorithm, because the signature is sort!(::Vector{Product}, ...) rather than sort!(::Vector{Float64}). In my package I try to dispatch on the inferred type that is being compared / sorted, which allows me to use the efficient floating point sorting algorithm :slight_smile:.

Results

n sort! sortsort! x faster
10_000 1.113 ms 560.6 μs 2.0x
1_000 66.00 μs 11.66 μs 5.7x
100 2.021 μs 615.6 ns 3.3x

Benchmark

using SortingSortingOut, BenchmarkTools

struct Product
    price::Int
    weight::Float64
end

weight(p::Product) = p.weight

function sort_products(n = 100)
    products = [Product(rand(1:100), 100rand()) for i = 1 : n]

    fst = @benchmark sort!(ps, by = $weight) setup = (ps = copy($products))
    snd = @benchmark sortsort!(ps, $(By(weight))) setup = (ps = copy($products))

    fst, snd
end

I’m hoping people could give feedback on this idea :smiley:

22 Likes

Strongly approve. This stuff could really use some love. Improving the performance of sortperm would also be a great thing.

5 Likes

Also happy to contribute my string sort algorithms WIP: faster string sort - #74 by xiaodai

1 Like

Well, since you asked for feedback… I was really hoping for something more than 80% faster. :wink:

6 Likes

So, is it 80% faster, or 5x as fast?

2 Likes

I wrote this a while back: https://github.com/JeffreySarnoff/SortingNetworks.jl
(use master for faster tuple handling)

Very nice :slight_smile: I also agree that we could clean up this area a bit.

One thing I would to see is some way of indicating that a data source is already sorted appropriately, so that findall, findfirst and so-on will just use searchsorted, searchsortedfirst, etc. I think I’ve seen this rough idea mentioned before, somewhere. (This also came up in one of my side projects more recently, the SortIndex in AcceleratedArrays.jl does exactly this).

2 Likes

That was probably it :slight_smile:

Thanks all!

One of the things that makes sortperm slower is that it guarantees stability (indices of repeated values appear in ascending order). If that requirement is dropped, the implementation would just be

sortperm(xs, ord) = sortsort!(Vector(OneTo(length(xs))), By(i -> @inbounds(xs[i]), ord))

and it could automatically dispatch to specialized sorting algorithms in my proof of concept code :slight_smile:.

That looks really nice! After reading the discussion on Github, I’m sure you have some ideas about what sort! should really dispatch on.

This took me a bit too long :stuck_out_tongue:. Fortunately there are the absolute numbers as well ^^.

So should this eventually dispatch on sort(::NTuple)?

I suspect that stability could be patched up after the fact for sortperm: do an unstable sort and then just scan through the indices and sort each equivalence class of indices (equivalent in the sense that v[i] == v[j]). Of course, in the worst case, that’s sorting the entire range of indices, but maybe in practice it would be ok. It seems like it should be possible to take advantage of the fact that we know the exact range of index values… seems like the perfect use case for a radix sort.

1 Like

Re sortperm performance / stability:

One could do it the same way as for normal sorts, i.e. it is only stable when the user requests it via the algorithm keyword.

For small bitstypes (or small results of By) and medium-sized arrays, it might make sense to sort! a temporary of (idx::Int, compareT::T) tuples, for better cache locality. I think this is what kills current sortperm performance.

1 Like

SortingNetworks master (pending merge into METADATA) already does dispatch on either N args or NTuples where N is in 1:16. And as to sort yes, only for small N (1…16, which are provably optimal as exchange networks, already ready).

So after giving it a try, I am very underwhelmed by the cache-local variant. Or, more precisely, I am absolutely shocked at how fast sortperm is, even though it should basically have no spatial locality at all.

julia> function _sortperm(A)
       tmp = [(@inbounds A[i],i) for i=1:length(A)]
       sort!(tmp)
       [t[2] for t in tmp]
       end
julia> v=rand(Int,10^8); sort(v[1:1000]); sortperm(v[1:1000]); _sortperm(v[1:1000]); 
julia> begin
       @time sort(v)
       @time sortperm(v)
       @time _sortperm(v);
       end;
 13.288238 seconds (6 allocations: 762.940 MiB, 0.70% gc time)
 42.778155 seconds (7 allocations: 762.940 MiB, 0.15% gc time)
 23.058652 seconds (13 allocations: 2.980 GiB, 0.47% gc time)

Sort is stable by default, which is the opposite.

Thanks! Then some zip-sort-extract variant is probably worthwhile for small bitstypes (as long as the temporary fits into memory), seeing a 3x speedup:

julia> function _sortperm(A)
              tmp = [(@inbounds A[i],i) for i=1:length(A)]
              sort!(tmp; alg=QuickSort)
              [t[2] for t in tmp]
              end
_sortperm (generic function with 1 method)

julia> v=rand(Int,10^8); sort(v[1:1000]); sortperm(v[1:1000]); _sortperm(v[1:1000]);

julia> begin
       @time sort(v)
       @time sortperm(v)
       @time _sortperm(v);
       end;
 13.137279 seconds (6 allocations: 762.940 MiB, 0.32% gc time)
 42.627918 seconds (7 allocations: 762.940 MiB, 0.00% gc time)
 16.295594 seconds (11 allocations: 2.235 GiB, 1.01% gc time)

PS. ~100KB - 1MB of buffer to sort appears to be an OK cross-over for the copy to amortize.