Package for partial sorting output iterator

Is there an implementation that partial sorts an iterator without allocating the whole output?

Eg suppose that I need the n largest elements by f from an iterator itr that has N \gg n elements.

partialsort!(collect(itr), n; by f) would allocate a length N vector, while the whole thing can be handled with a length n vector that is kept sorted.

2 Likes

One quick implementation would be like:

using DataStructures
function partialsort(itr, n)
    ret = SortedSet{eltype(itr)}()
    for x in itr
        push!(ret, x)
        if length(ret) > n
            pop!(ret)
        end
    end
    ret
end
6 Likes

Perhaps SortedSet can cause problems because it folds identical items (as a mathematical set does). An alternative data structure to use is a heap:

function partialsort2(itr, n)
    ret = BinaryMinMaxHeap{eltype(itr)}()
    for x in itr
        push!(ret, x)
        if length(ret) > n
            popmin!(ret)
        end
    end
    ret
end
6 Likes

Note that these two solutions are O(N\log n) while partialsort! has the potential to be O(N) (I didn’t look at the actual implementation though).

You can’t do better than O(N log(N)) for full comparison sort. This means in general you also can’t to partial sort in O(N). Otherwise you would just choose n=N and end up with a faster full comparison sort.
Now if your problem has extra structure, there can be faster algorithms. Like O(N) radix sort for a vector of integers. Or O(1) if your collection is say a range.

1 Like

I think you are misinterpreting partialsort!. You don’t need the output array to be sorted, you only need to find the position of the n-th smallest element (and as a consequence you can find the position of the n smallest elements with an extra O(n) time). Setting n=N would just give you the largest element, which obviously can be found in time O(N).

2 Likes

Oh thanks for clarifying.

2 Likes

I think this operation may done in O(N) time as follows. For simplicity of explanation, assume N is a multiple of n. Form a buffer with 2n slots. Initially load the buffer with 2n items from the iterator. Then find the median, which requires O(n) time, and discard the elements below the median, leaving n items in the buffer. Now load another n items from the iterator, again filling the buffer. Repeat the operation of taking the median and discarding the n smaller elements. The number of median operations is N/n, and each one requires O(n) operations, so the total running time is O(N).

The OP asked for a package rather than an algorithm; I don’t know of a package to implement this operation.

1 Like