Hey!
I’m in need of a fast method that uniformly subsamples the elements of a given array until only a specified number of unique elements is left. For example input array=[4,1,2,2,4] and target_unique_el=2, then the (random) result could be res=[4,2,2] (as it has two unique elements left).
My current implementation is:
using Random
using BenchmarkTools
function subsample_unique!(array, target_unique_el)
shuffle!(array) # for uniform sampling
unique_el = length(Set(array))
while unique_el>target_unique_el
pop_el = pop!(array)
if pop_el ∉ array
unique_el -= 1
end
end
end
Thanks! Using a dict for the look-up might be a bit faster (~3 times), however creates a bit more allocations.
using StatsBase
using Random
using BenchmarkTools
function subsample_unique_2!(array, target_unique_el)
shuffle!(array)
d = countmap(array)
unique_el = d.count
while unique_el>target_unique_el
pop_el = pop!(array)
d[pop_el] -= 1
if d[pop_el]==0
unique_el -= 1
end
end
end
julia> function subsample_unique!(array, target_unique_el)
shuffle!(array) # for uniform sampling
unique_el = length(Set(array))
while unique_el>target_unique_el
pop_el = pop!(array)
if pop_el ∉ array
unique_el -= 1
end
end
end
subsample_unique! (generic function with 1 method)
julia> function subsample_unique2(array, target_unique_el)
s = BitSet()
while length(s) < target_unique_el
push!(s, rand(array))
end
return collect(s)
end
subsample_unique2! (generic function with 1 method)
julia> function subsample_unique3(array, target_unique_el)
s = Set()
while length(s) < target_unique_el
push!(s, rand(array))
end
return collect(s)
end
subsample_unique3! (generic function with 1 method)
julia> using BenchmarkTools
julia> array = rand(1:300, 1000);
julia> using Random
julia> @btime subsample_unique!(run, 100) setup=(run=deepcopy(array)) evals=1;
123.600 μs (7 allocations: 18.66 KiB)
julia> @btime subsample_unique2(run, 100) setup=(run=deepcopy(array)) evals=1;
2.000 μs (5 allocations: 1.11 KiB)
julia> @btime subsample_unique3(run, 100) setup=(run=deepcopy(array)) evals=1;
7.900 μs (11 allocations: 4.75 KiB)
(using BitSet is faster, but Set is more general)
Also I did not add checking if the result is achievable (so the procedure assumes that you are not requesting number of unique values that is very large in comparison to true number of unique values - in that case you can shuffle! your array and then add elements sequentially to the Set or BitSet).
Thanks! These are fast But I have to say they are a bit different. Your results would only contain the unique elements and not duplicates anymore. I have to think but that might be fine for my purposes…
Ah - OK. I have misunderstood the question. Then do the following:
function subsample_unique!(array, target_unique_el)
target_unique_el == 0 && return empty!(array)
shuffle!(array) # for uniform sampling
tmp = BitSet()
i = 0
for v in array
i += 1
push!(tmp, v)
length(tmp) < target_unique_el || break
end
length(tmp) != target_unique_el && error("failed to sample $target_unique_el unique values")
return resize!(array, i)
end
(not tested, so there might be bugs, but I hope the idea is clear)
Thanks both of you! Your last two versions seem on par for the test array here. For my ‘real’ array the last version is much faster in my hands (maybe has something to do with the maximum value around 2_000_000). Really helpful altogether!
EDIT: If I replace BitSet() with Set() in @bkamins version, this is also almost the same performance for my ‘real’ array.
u can probably do it faster by customizing the shuffle! by basically rolling your own and keep count of the number of uniques encountered so far and stop shuffling as soon as the target number has been exceeded.
thank you both, these are good points! currently, I’m fine with this, as now another method is (again) the main run time bottleneck of my code. was great to get this in the low µs range!
This should be blazing fast! The main idea is to shuffle only required indices of the array as you go, and keep track of the number of unique elements. One the target number is reached, you stop the loop and resize the array accordingly, hence zero allocations. This basically is the idea of @xiaodai given in his last comment.
function subsample_unique3!(array, target_unique_el)
n = length(array)
i = 0
count = 0
while count < target_unique_el
i += 1
# shuffle current element
r = rand(i:n)
array[i], array[r] = array[r], array[i]
ai = array[i]
unique_i = true
for j = 1:i-1
if array[j] == ai
unique_i = false
break
end
end
count += unique_i
end
resize!(array,i)
end
array = rand(1:2_000_000, 1000);
aa = deepcopy(array);
b = subsample_unique3!(aa, 100);
@assert length(unique(b)) == 100
@btime subsample_unique3!(run, 100) setup=(run=deepcopy(array)) evals=1
3.100 μs (0 allocations: 0 bytes)
nice! For the test array here it is indeed faster (3.867 μs (0 allocations: 0 bytes) vs. 6.300 μs (2 allocations: 160 bytes)), for my real array however it is not (47.610 μs (0 allocations: 0 bytes) vs. 24.802 μs (3 allocations: 228.05 KiB)). The reason might be that the two nested loops are inefficient compared to one loop + lookup array (with some allocations) (?).