Keep n largest values compared by f

I am iterating through about 10^6 values x and want to keep n of those for which f(x) is the largest.n is around 1000.

I have looked at similar topics (1, 2, 3), some suggest maintaining a vector (see below), some min-max heaps from DataStructures.jl, but I found that the implementation does not support a comparison function.

For reference, here is a trivial implementation of using vectors:

struct KeepLargest{T,F}
    max_count::Int
    largest::Vector{T}
    by::F
    function KeepLargest{T}(max_count, by::F = identity) where {T,F}
        new{T,F}(max_count, Vector{T}(), by)
    end
end

function record!(kl::KeepLargest{T}, x::T) where T
    # note: we simply maintain the invariant issorted(kl.largest; kl.by)
    (; max_count, largest, by) = kl
    i = searchsortedlast(largest, by(x); by = first) # new position
    if length(largest) < max_count
        insert!(largest, i + 1, x)
    elseif i > 0
        for j in 1:(i-1)
            largest[j] = largest[j + 1]
        end
        largest[i] = x
    end
end

An MWE with test:

kl = KeepLargest{Pair{Float64,Float64}}(10, first)
y = Vector{Pair{Float64,Float64}}()
for _ in 1:1000
    r = randn()
    record!(kl, r^2 => r)
    push!(y, r^2 => r)
    @assert issorted(kl.largest; by = kl.by)
end
sort!(y; by = kl.by)
@assert y[(end-10+1):end] == kl.largest

I have two questions:

  1. is there anything more efficient in a package somewhere?

  2. if not, is there a package which has more or less the above implementation? (It feels kind of silly to include it as a utility in a totally unrelated package, with tests and all.)

  • the best alg probably depends on the structure of your data (does it look random?)
  • in the general case it’s probably hard to beat a bounded size min heap
  • if the data is already an AbstractArray then partialsort! will be close to optimal most of the time and very easy
  • the insert-sorted approach can be improved a lot by doing less shifting around in insert! and adding a fast path to reject elements that are already smaller than smallest element

Can’t you simply use a BinaryHeap from datastructures.jl? That supports a custom Ordering DataStructures.jl/src/heaps/binary_heap.jl at 4513f67936c4c758688c3893493339ec6a202183 · JuliaCollections/DataStructures.jl · GitHub

You don’t need a min-max heap, you just need the minimal value (if the new value is larger than the current min, and you have 1000 values, then pop the old min and push the value).

Doing that already (i > 0).

I looked at the code and it is implemented using a vector, like the above. So I guess I could use that, as it is equivalent. Correction: No, it is of course not, just using one for storage.

Good point!

As I commented in the other thread, you can (in principle) do better than this by combining the pop and push operations: Maintaining a fixed size "top N" values list - #7 by stevengj (and that implementation also accepts an arbitrary Ordering).

Once I benchmarked various implementations I realized the obvious: if I am keeping the 0.1% largest values, and the scalars I am sorting on are IID (which they are), then asymptotically the probability of inserting a new element is 0.1%. It does not matter if it is a heap, a sorted vector, or whatever, as long as I can quickly decide whether I need any insertion.

:person_facepalming:

Worrying about the best implementation is quite silly for my use case.