To make Ordering objects more composable and reusable
To figure out a better way to dispatch on efficient sorting algorithms.
The following is one of the motivating examples where sort! does not dispatch on an efficient floating point sorting algorithm, because the signature is sort!(::Vector{Product}, ...) rather than sort!(::Vector{Float64}). In my package I try to dispatch on the inferred type that is being compared / sorted, which allows me to use the efficient floating point sorting algorithm .
Results
n
sort!
sortsort!
x faster
10_000
1.113 ms
560.6 μs
2.0x
1_000
66.00 μs
11.66 μs
5.7x
100
2.021 μs
615.6 ns
3.3x
Benchmark
using SortingSortingOut, BenchmarkTools
struct Product
price::Int
weight::Float64
end
weight(p::Product) = p.weight
function sort_products(n = 100)
products = [Product(rand(1:100), 100rand()) for i = 1 : n]
fst = @benchmark sort!(ps, by = $weight) setup = (ps = copy($products))
snd = @benchmark sortsort!(ps, $(By(weight))) setup = (ps = copy($products))
fst, snd
end
I’m hoping people could give feedback on this idea
Very nice I also agree that we could clean up this area a bit.
One thing I would to see is some way of indicating that a data source is already sorted appropriately, so that findall, findfirst and so-on will just use searchsorted, searchsortedfirst, etc. I think I’ve seen this rough idea mentioned before, somewhere. (This also came up in one of my side projects more recently, the SortIndex in AcceleratedArrays.jl does exactly this).
One of the things that makes sortperm slower is that it guarantees stability (indices of repeated values appear in ascending order). If that requirement is dropped, the implementation would just be
I suspect that stability could be patched up after the fact for sortperm: do an unstable sort and then just scan through the indices and sort each equivalence class of indices (equivalent in the sense that v[i] == v[j]). Of course, in the worst case, that’s sorting the entire range of indices, but maybe in practice it would be ok. It seems like it should be possible to take advantage of the fact that we know the exact range of index values… seems like the perfect use case for a radix sort.
One could do it the same way as for normal sorts, i.e. it is only stable when the user requests it via the algorithm keyword.
For small bitstypes (or small results of By) and medium-sized arrays, it might make sense to sort! a temporary of (idx::Int, compareT::T) tuples, for better cache locality. I think this is what kills current sortperm performance.
SortingNetworks master (pending merge into METADATA) already does dispatch on either N args or NTuples where N is in 1:16. And as to sortyes, only for small N (1…16, which are provably optimal as exchange networks, already ready).
So after giving it a try, I am very underwhelmed by the cache-local variant. Or, more precisely, I am absolutely shocked at how fast sortperm is, even though it should basically have no spatial locality at all.
julia> function _sortperm(A)
tmp = [(@inbounds A[i],i) for i=1:length(A)]
sort!(tmp)
[t[2] for t in tmp]
end
julia> v=rand(Int,10^8); sort(v[1:1000]); sortperm(v[1:1000]); _sortperm(v[1:1000]);
julia> begin
@time sort(v)
@time sortperm(v)
@time _sortperm(v);
end;
13.288238 seconds (6 allocations: 762.940 MiB, 0.70% gc time)
42.778155 seconds (7 allocations: 762.940 MiB, 0.15% gc time)
23.058652 seconds (13 allocations: 2.980 GiB, 0.47% gc time)
Thanks! Then some zip-sort-extract variant is probably worthwhile for small bitstypes (as long as the temporary fits into memory), seeing a 3x speedup:
julia> function _sortperm(A)
tmp = [(@inbounds A[i],i) for i=1:length(A)]
sort!(tmp; alg=QuickSort)
[t[2] for t in tmp]
end
_sortperm (generic function with 1 method)
julia> v=rand(Int,10^8); sort(v[1:1000]); sortperm(v[1:1000]); _sortperm(v[1:1000]);
julia> begin
@time sort(v)
@time sortperm(v)
@time _sortperm(v);
end;
13.137279 seconds (6 allocations: 762.940 MiB, 0.32% gc time)
42.627918 seconds (7 allocations: 762.940 MiB, 0.00% gc time)
16.295594 seconds (11 allocations: 2.235 GiB, 1.01% gc time)
PS. ~100KB - 1MB of buffer to sort appears to be an OK cross-over for the copy to amortize.