I am iterating through about 10^6 values x and want to keep n of those for which f(x) is the largest.n is around 1000.
I have looked at similar topics (1, 2, 3), some suggest maintaining a vector (see below), some min-max heaps from DataStructures.jl, but I found that the implementation does not support a comparison function.
For reference, here is a trivial implementation of using vectors:
struct KeepLargest{T,F}
max_count::Int
largest::Vector{T}
by::F
function KeepLargest{T}(max_count, by::F = identity) where {T,F}
new{T,F}(max_count, Vector{T}(), by)
end
end
function record!(kl::KeepLargest{T}, x::T) where T
# note: we simply maintain the invariant issorted(kl.largest; kl.by)
(; max_count, largest, by) = kl
i = searchsortedlast(largest, by(x); by = first) # new position
if length(largest) < max_count
insert!(largest, i + 1, x)
elseif i > 0
for j in 1:(i-1)
largest[j] = largest[j + 1]
end
largest[i] = x
end
end
An MWE with test:
kl = KeepLargest{Pair{Float64,Float64}}(10, first)
y = Vector{Pair{Float64,Float64}}()
for _ in 1:1000
r = randn()
record!(kl, r^2 => r)
push!(y, r^2 => r)
@assert issorted(kl.largest; by = kl.by)
end
sort!(y; by = kl.by)
@assert y[(end-10+1):end] == kl.largest
I have two questions:
-
is there anything more efficient in a package somewhere?
-
if not, is there a package which has more or less the above implementation? (It feels kind of silly to include it as a utility in a totally unrelated package, with tests and all.)