I’m in need of a fast method that uniformly subsamples the elements of a given array until only a specified number of unique elements is left. For example input
target_unique_el=2, then the (random) result could be
res=[4,2,2] (as it has two unique elements left).
My current implementation is:
using Random using BenchmarkTools function subsample_unique!(array, target_unique_el) shuffle!(array) # for uniform sampling unique_el = length(Set(array)) while unique_el>target_unique_el pop_el = pop!(array) if pop_el ∉ array unique_el -= 1 end end end
which performs as
const array = rand(1:300, 1000) @btime subsample_unique!(run, 100) setup=(run=deepcopy(array)) evals=1 # 97.754 μs (7 allocations: 18.66 KiB)
The few allocations come from the
Set() creation and I think most of time is currently spent in the look-up at
pop_el ∉ array.
subsample_unique!(array, 100) length(array) # 134 length(unique(array)) # 100
If possible, I want to improve the performance. Glad for any tips/hints!