Sub-sample array until a target number of unique elements is left

Hey!
I’m in need of a fast method that uniformly subsamples the elements of a given array until only a specified number of unique elements is left. For example input array=[4,1,2,2,4] and target_unique_el=2, then the (random) result could be res=[4,2,2] (as it has two unique elements left).
My current implementation is:

using Random
using BenchmarkTools

function subsample_unique!(array, target_unique_el)
    shuffle!(array) # for uniform sampling
    unique_el = length(Set(array))
    while unique_el>target_unique_el
        pop_el = pop!(array)
        if pop_el ∉ array
            unique_el -= 1
        end
    end
end

which performs as

const array = rand(1:300, 1000)
@btime subsample_unique!(run, 100) setup=(run=deepcopy(array)) evals=1
# 97.754 μs (7 allocations: 18.66 KiB)

The few allocations come from the Set() creation and I think most of time is currently spent in the look-up at pop_el ∉ array.

Example results:

subsample_unique!(array, 100)
length(array) # 134
length(unique(array)) # 100

If possible, I want to improve the performance. Glad for any tips/hints!

1 Like

do you the min and max possible values for your elements are they small?

I know them, 1 and about 2_000_000.

Thanks! Using a dict for the look-up might be a bit faster (~3 times), however creates a bit more allocations.

using StatsBase
using Random
using BenchmarkTools

function subsample_unique_2!(array, target_unique_el)
    shuffle!(array)
    d = countmap(array)
    unique_el = d.count
    while unique_el>target_unique_el
        pop_el = pop!(array)
        d[pop_el] -= 1
        if d[pop_el]==0
            unique_el -= 1
        end
    end
end

benchmark:

const array = rand(1:300, 1000)
@btime subsample_unique_2!(run, 100) setup=(run=deepcopy(array)) evals=1
# 30.996 μs (15 allocations: 34.41 KiB)

Here are some natural alternatives:

julia> function subsample_unique!(array, target_unique_el)
           shuffle!(array) # for uniform sampling
           unique_el = length(Set(array))
           while unique_el>target_unique_el
               pop_el = pop!(array)
               if pop_el ∉ array
                   unique_el -= 1
               end
           end
       end
subsample_unique! (generic function with 1 method)

julia> function subsample_unique2(array, target_unique_el)
           s = BitSet()
           while length(s) < target_unique_el
               push!(s, rand(array))
           end
           return collect(s)
       end
subsample_unique2! (generic function with 1 method)

julia> function subsample_unique3(array, target_unique_el)
           s = Set()
           while length(s) < target_unique_el
               push!(s, rand(array))
           end
           return collect(s)
       end
subsample_unique3! (generic function with 1 method)

julia> using BenchmarkTools

julia> array = rand(1:300, 1000);

julia> using Random

julia> @btime subsample_unique!(run, 100) setup=(run=deepcopy(array)) evals=1;
  123.600 μs (7 allocations: 18.66 KiB)

julia> @btime subsample_unique2(run, 100) setup=(run=deepcopy(array)) evals=1;
  2.000 μs (5 allocations: 1.11 KiB)

julia> @btime subsample_unique3(run, 100) setup=(run=deepcopy(array)) evals=1;
  7.900 μs (11 allocations: 4.75 KiB)

(using BitSet is faster, but Set is more general)

Also I did not add checking if the result is achievable (so the procedure assumes that you are not requesting number of unique values that is very large in comparison to true number of unique values - in that case you can shuffle! your array and then add elements sequentially to the Set or BitSet).

1 Like

Thanks! These are fast :smiley: But I have to say they are a bit different. Your results would only contain the unique elements and not duplicates anymore. I have to think but that might be fine for my purposes…

operating on an array like this is expensive. better do what @bkamins suggest and build a new one

good point

Ah - OK. I have misunderstood the question. Then do the following:

function subsample_unique!(array, target_unique_el)
    target_unique_el == 0 && return empty!(array)
    shuffle!(array) # for uniform sampling
    tmp = BitSet()
    i = 0
    for v in array
        i += 1
        push!(tmp, v)
        length(tmp) < target_unique_el || break
    end
    length(tmp) != target_unique_el && error("failed to sample $target_unique_el unique values")
    return resize!(array, i)
end

(not tested, so there might be bugs, but I hope the idea is clear)

2 Likes

this is faster

const array = rand(1:300, 1000);

using Random: shuffle!

function subsample_unique4!(array, target_unique_el)
    shuffle!(array)

    maxval = maximum(array)

    has_array = falses(maxval)

    unique_count = 0
    @inbounds for (i, a) in enumerate(array)
        if !has_array[a]
            has_array[a] = true
            unique_count += 1
            if unique_count > target_unique_el
                return resize!(array, i-1)
            end
        end
    end
end

aa = deepcopy(array)
b = subsample_unique4!(aa, 100);

@assert length(unique(b)) == 100

@benchmark subsample_unique4!(run, 100) setup=(run=deepcopy(array)) evals=1

BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  2
  --------------
  minimum time:     6.400 μs (0.00% GC)
  median time:      7.600 μs (0.00% GC)
  mean time:        9.015 μs (6.56% GC)
  maximum time:     5.941 ms (99.54% GC)
  --------------
  samples:          10000
  evals/sample:     1
1 Like

Thanks both of you! Your last two versions seem on par for the test array here. For my ‘real’ array the last version is much faster in my hands (maybe has something to do with the maximum value around 2_000_000). Really helpful altogether!

EDIT: If I replace BitSet() with Set() in @bkamins version, this is also almost the same performance for my ‘real’ array.

1 Like

Note that the implementation by @xiaodai assumes that you have only positive entries in your data.

2 Likes

u can probably do it faster by customizing the shuffle! by basically rolling your own and keep count of the number of uniques encountered so far and stop shuffling as soon as the target number has been exceeded.

1 Like

thank you both, these are good points! currently, I’m fine with this, as now another method is (again) the main run time bottleneck of my code. was great to get this in the low µs range!

1 Like

This should be blazing fast! The main idea is to shuffle only required indices of the array as you go, and keep track of the number of unique elements. One the target number is reached, you stop the loop and resize the array accordingly, hence zero allocations. This basically is the idea of @xiaodai given in his last comment.

function subsample_unique3!(array, target_unique_el)
    n = length(array)
    i = 0
    count = 0
    while count < target_unique_el
        i += 1
        
        # shuffle current element
        r = rand(i:n)
        array[i], array[r] = array[r], array[i]
        
        ai = array[i]
        unique_i = true
        for j = 1:i-1 
            if array[j] == ai
                unique_i = false  
                break 
            end 
        end
        count += unique_i 
    end
    resize!(array,i)
end

array = rand(1:2_000_000, 1000);
aa = deepcopy(array);
b = subsample_unique3!(aa, 100);
@assert length(unique(b)) == 100

@btime subsample_unique3!(run, 100) setup=(run=deepcopy(array)) evals=1
  3.100 μs (0 allocations: 0 bytes)

you need to exceed it then scale back

if that’s not the case, can probably use some offset arrays package to handle that case

Yes. Or manually calculate the offset.

nice! For the test array here it is indeed faster (3.867 μs (0 allocations: 0 bytes) vs.
6.300 μs (2 allocations: 160 bytes)), for my real array however it is not (47.610 μs (0 allocations: 0 bytes) vs. 24.802 μs (3 allocations: 228.05 KiB)). The reason might be that the two nested loops are inefficient compared to one loop + lookup array (with some allocations) (?).