Sorting Policy

What we have now:

Right now Base/Sort.jl has a sorting policy implementation that is somewhat counterintuitive to me.

I see three levels of function call,

  1. Keyword: sort([2,3,7], by=x->floor(sqrt(x)), rev=true) with 10 methods:
    sort, sort!, partialsort, partialsort!, sortperm, sortperm!, partialsortperm, partialsortperm!, sort on a Matrix/Array, and sort! on a Matrix/Array.
  2. 3-argument: sort!([2,3,7], QuickSort, By(x->floor(sqrt(x)), Reverse)) with 5 methods:
    sort!, fpsort!, sort_int_range!, partial_sort!, and sortperm_int_range
  3. 5-argument: sort!([2,3,7], 1, 3, QuickSort, By(x->floor(sqrt(x)), Reverse)) with 1 function dispatched on 4 algorithm types:
    InsertionSort, QuickSort, PartialQuickSort, and `Merge

Which of these levels is part of the interface? I’d like it to be just the first, and only the first is documented, but sort! is exported, and both the 3 and 5 argument versions are therefore also exported. Sort`.

sort, partialsort, & partialsortperm internally copymutable and call their corresponding mutating version at the keyword level.

sortperm!, partialsortperm!, & sort! on Matrix/Array internally call the 3 argument method of sort!.

Similarly, partialsort! calls the 3 argument method of partialsort!

This leaves sort!, sortperm, and sort on Matrix/Array with non-passthrough behavior: sort! dynamically chooses between the three argument sort! and sort_int_range!. Similarly sortperm dynamically chooses between the three argument sort! and sortperm_int_range. Finally, sort on Matrix/Array jumps via its sort_chunks! to the 5 argument sort!.

At the three argument level, sort! specializes to sometimes statically dispatch to fpsort!. All of sort!, fpsort!, and partialsort! then call the 5 argument method of sort!.

This structure conflates the presence of lo and hi with policy decisions of whether to perform a floating point sort…

What I propose:

I propose to eliminate the 5-argument methods. If we want to talk about the 5th through 63rd element of a 700 element long v::AbstractVector, we should use view(v, 5:63)::AbstractVector, not v::AbstractVector, 5::Integer, 63::Integer.

The 5-argument/3-argument distinction serves a second (counterintuitive) purpose, of facilitating dispatch to fpsort!. I propose to replace this with a new Algorithm:

struct DetectSpecialCases <: Algorithm

and set const DEFAULT_STABLE = DetectSpecialCases(MergeSort) and const DEFAULT_UNSTABLE = DetectSpecialCases(InsertionSort).

We then move (almost*) all policy decisions into the method sort!(v::AbstractVector, a::DetectSpecialCases, o::Ordering) by defining

sort!(v::AbstractVector{Float}, a::DetectSpecialCases, o::DirectOrdering) = fpsort(v,a,o)
function sort!(v::AbstractVector, a::DetectSpecialCases, o::Ordering) 
    if should_int_range(v, o)
        sort_int_range!(v, o)
        sort!(v, a.default, o)

The keyword partialsort! method would directly call the 3 argument sort! method.

For backward compatibility we could define deprecated sort!(v::AbstractVector, lo::Integer, hi::Integer, a::Algorithm, o::Ordering) = sort!(view(v, lo:hi), a, o)).

I think this explicit handling of special case detection is more clear and more easily usable by folks who have domain knowledge but don’t want to use un-documented or un-exported functionality, Base developers, and even folks are who want to use internals. If would be easier, for example, to correctly insert a hook for dispatching to RadixSort. Hopefully searching for DetectSpecialCases will be an effective way to understand base’s internal sorting policy decisions.

Some objective impacts:
sort(rand(1:10, 100), alg=MergeSort) would use merge sort rather than counting sort
partialsort(rand(100), 1:4) would utilize the fpsort! optimizations.
sort(rand(10,10), dims=1) would utilize the fpsort! optimizations.

*the one remaining policy decision is the dispatch of sortperm (not sortperm!) on Vectors (not AbstractVectors), of eltype <: Integer, sorted in Forward order, with a range strictly smaller than the type’s range (not sortperm(rand(UInt8, 10_000))), and smaller than half the length of the collection (neither sortperm(rand(Int16, 10_000)) nor sortperm(rand(1:100, 150))).


Vectorized and performance-portable Quicksort

Thanks! I’ll look into it!

For the record, Julia’s default algorithms are already as fast/faster than the performance figures they claim.

They say “For one million 32/64/128-bit numbers, our code running on Apple M1 can produce sorted output at rates of 499/471/466 MB/s”

On Juila 1.9.0-DEV.439 on my 2019 intel i5 Mac, we have

julia> sizeof(rand(Int, 10^6)) / (@belapsed sort!(x) (setup=(x=rand(Int, 10^6))))

for 306 MB/s

306 is a bit lower than 471, but this is more than explained by CPU difference: “In some tests, the M1 shows a slight improvement, but in most tests, the M1 blows the Intel version away.” (source)


On the M1 mac I get

julia> sizeof(rand(Int, 10^6)) / (@belapsed sort!(x) (setup=(x=rand(Int, 10^6))))

so considerably faster

1 Like

Huzzah! Thanks for the benchmark.

Their technique seems related to what’s going on in Stabilize, optimize, and increase robustness of QuickSort by LilithHafner · Pull Request #45222 · JuliaLang/julia · GitHub. One difference is that I’m not particularly interested in dealing with @simd and friends right now, so I’m writing ordinary for loops and waiting for the wonderful folks in compiler optimizations to do it for me.

I initially thought we’re faster than Google because we don’t use quick sort, but then

julia> Base.Sort.defalg(rand(Int, 10))

so I’m confused, what are we doing that’s so good?

You’re probably running a Julia version before v1.9.0-DEV. We’re faster on 1.9 because we use radix sort.

that’s what I thought. Which makes the “Julia is faster” pointless because it’s not quick sort…

Yes and no. You could instead argue that manual SIMD optimization of quicksort is the side that’s pointless since for the places you would use it, you could just use radix sort instead.

1 Like