Sorting Policy

What we have now:

Right now Base/Sort.jl has a sorting policy implementation that is somewhat counterintuitive to me.

I see three levels of function call,

  1. Keyword: sort([2,3,7], by=x->floor(sqrt(x)), rev=true) with 10 methods:
    sort, sort!, partialsort, partialsort!, sortperm, sortperm!, partialsortperm, partialsortperm!, sort on a Matrix/Array, and sort! on a Matrix/Array.
  2. 3-argument: sort!([2,3,7], QuickSort, By(x->floor(sqrt(x)), Reverse)) with 5 methods:
    sort!, fpsort!, sort_int_range!, partial_sort!, and sortperm_int_range
  3. 5-argument: sort!([2,3,7], 1, 3, QuickSort, By(x->floor(sqrt(x)), Reverse)) with 1 function dispatched on 4 algorithm types:
    InsertionSort, QuickSort, PartialQuickSort, and `Merge

Which of these levels is part of the interface? I’d like it to be just the first, and only the first is documented, but sort! is exported, and both the 3 and 5 argument versions are therefore also exported. Sort`.

sort, partialsort, & partialsortperm internally copymutable and call their corresponding mutating version at the keyword level.

sortperm!, partialsortperm!, & sort! on Matrix/Array internally call the 3 argument method of sort!.

Similarly, partialsort! calls the 3 argument method of partialsort!

This leaves sort!, sortperm, and sort on Matrix/Array with non-passthrough behavior: sort! dynamically chooses between the three argument sort! and sort_int_range!. Similarly sortperm dynamically chooses between the three argument sort! and sortperm_int_range. Finally, sort on Matrix/Array jumps via its sort_chunks! to the 5 argument sort!.

At the three argument level, sort! specializes to sometimes statically dispatch to fpsort!. All of sort!, fpsort!, and partialsort! then call the 5 argument method of sort!.

This structure conflates the presence of lo and hi with policy decisions of whether to perform a floating point sort…

What I propose:

I propose to eliminate the 5-argument methods. If we want to talk about the 5th through 63rd element of a 700 element long v::AbstractVector, we should use view(v, 5:63)::AbstractVector, not v::AbstractVector, 5::Integer, 63::Integer.

The 5-argument/3-argument distinction serves a second (counterintuitive) purpose, of facilitating dispatch to fpsort!. I propose to replace this with a new Algorithm:

struct DetectSpecialCases <: Algorithm

and set const DEFAULT_STABLE = DetectSpecialCases(MergeSort) and const DEFAULT_UNSTABLE = DetectSpecialCases(InsertionSort).

We then move (almost*) all policy decisions into the method sort!(v::AbstractVector, a::DetectSpecialCases, o::Ordering) by defining

sort!(v::AbstractVector{Float}, a::DetectSpecialCases, o::DirectOrdering) = fpsort(v,a,o)
function sort!(v::AbstractVector, a::DetectSpecialCases, o::Ordering) 
    if should_int_range(v, o)
        sort_int_range!(v, o)
        sort!(v, a.default, o)

The keyword partialsort! method would directly call the 3 argument sort! method.

For backward compatibility we could define deprecated sort!(v::AbstractVector, lo::Integer, hi::Integer, a::Algorithm, o::Ordering) = sort!(view(v, lo:hi), a, o)).

I think this explicit handling of special case detection is more clear and more easily usable by folks who have domain knowledge but don’t want to use un-documented or un-exported functionality, Base developers, and even folks are who want to use internals. If would be easier, for example, to correctly insert a hook for dispatching to RadixSort. Hopefully searching for DetectSpecialCases will be an effective way to understand base’s internal sorting policy decisions.

Some objective impacts:
sort(rand(1:10, 100), alg=MergeSort) would use merge sort rather than counting sort
partialsort(rand(100), 1:4) would utilize the fpsort! optimizations.
sort(rand(10,10), dims=1) would utilize the fpsort! optimizations.

*the one remaining policy decision is the dispatch of sortperm (not sortperm!) on Vectors (not AbstractVectors), of eltype <: Integer, sorted in Forward order, with a range strictly smaller than the type’s range (not sortperm(rand(UInt8, 10_000))), and smaller than half the length of the collection (neither sortperm(rand(Int16, 10_000)) nor sortperm(rand(1:100, 150))).

1 Like