Sub-sample array until a target number of unique elements is left

mlanghinrichs · May 28, 2021, 10:09am

Hey!
I’m in need of a fast method that uniformly subsamples the elements of a given array until only a specified number of unique elements is left. For example input array=[4,1,2,2,4] and target_unique_el=2, then the (random) result could be res=[4,2,2] (as it has two unique elements left).
My current implementation is:

using Random
using BenchmarkTools

function subsample_unique!(array, target_unique_el)
    shuffle!(array) # for uniform sampling
    unique_el = length(Set(array))
    while unique_el>target_unique_el
        pop_el = pop!(array)
        if pop_el ∉ array
            unique_el -= 1
        end
    end
end

which performs as

const array = rand(1:300, 1000)
@btime subsample_unique!(run, 100) setup=(run=deepcopy(array)) evals=1
# 97.754 μs (7 allocations: 18.66 KiB)

The few allocations come from the Set() creation and I think most of time is currently spent in the look-up at pop_el ∉ array.

Example results:

subsample_unique!(array, 100)
length(array) # 134
length(unique(array)) # 100

If possible, I want to improve the performance. Glad for any tips/hints!

xiaodai · May 28, 2021, 10:34am

do you the min and max possible values for your elements are they small?

mlanghinrichs · May 28, 2021, 10:37am

I know them, 1 and about 2_000_000.

mlanghinrichs · May 28, 2021, 10:55am

Thanks! Using a dict for the look-up might be a bit faster (~3 times), however creates a bit more allocations.

using StatsBase
using Random
using BenchmarkTools

function subsample_unique_2!(array, target_unique_el)
    shuffle!(array)
    d = countmap(array)
    unique_el = d.count
    while unique_el>target_unique_el
        pop_el = pop!(array)
        d[pop_el] -= 1
        if d[pop_el]==0
            unique_el -= 1
        end
    end
end

benchmark:

const array = rand(1:300, 1000)
@btime subsample_unique_2!(run, 100) setup=(run=deepcopy(array)) evals=1
# 30.996 μs (15 allocations: 34.41 KiB)

bkamins · May 28, 2021, 10:56am

Here are some natural alternatives:

julia> function subsample_unique!(array, target_unique_el)
           shuffle!(array) # for uniform sampling
           unique_el = length(Set(array))
           while unique_el>target_unique_el
               pop_el = pop!(array)
               if pop_el ∉ array
                   unique_el -= 1
               end
           end
       end
subsample_unique! (generic function with 1 method)

julia> function subsample_unique2(array, target_unique_el)
           s = BitSet()
           while length(s) < target_unique_el
               push!(s, rand(array))
           end
           return collect(s)
       end
subsample_unique2! (generic function with 1 method)

julia> function subsample_unique3(array, target_unique_el)
           s = Set()
           while length(s) < target_unique_el
               push!(s, rand(array))
           end
           return collect(s)
       end
subsample_unique3! (generic function with 1 method)

julia> using BenchmarkTools

julia> array = rand(1:300, 1000);

julia> using Random

julia> @btime subsample_unique!(run, 100) setup=(run=deepcopy(array)) evals=1;
  123.600 μs (7 allocations: 18.66 KiB)

julia> @btime subsample_unique2(run, 100) setup=(run=deepcopy(array)) evals=1;
  2.000 μs (5 allocations: 1.11 KiB)

julia> @btime subsample_unique3(run, 100) setup=(run=deepcopy(array)) evals=1;
  7.900 μs (11 allocations: 4.75 KiB)

(using BitSet is faster, but Set is more general)

Also I did not add checking if the result is achievable (so the procedure assumes that you are not requesting number of unique values that is very large in comparison to true number of unique values - in that case you can shuffle! your array and then add elements sequentially to the Set or BitSet).

mlanghinrichs · May 28, 2021, 11:05am

Thanks! These are fast But I have to say they are a bit different. Your results would only contain the unique elements and not duplicates anymore. I have to think but that might be fine for my purposes…

xiaodai · May 28, 2021, 11:05am

operating on an array like this is expensive. better do what @bkamins suggest and build a new one

xiaodai · May 28, 2021, 11:06am

good point

bkamins · May 28, 2021, 11:14am

Ah - OK. I have misunderstood the question. Then do the following:

function subsample_unique!(array, target_unique_el)
    target_unique_el == 0 && return empty!(array)
    shuffle!(array) # for uniform sampling
    tmp = BitSet()
    i = 0
    for v in array
        i += 1
        push!(tmp, v)
        length(tmp) < target_unique_el || break
    end
    length(tmp) != target_unique_el && error("failed to sample $target_unique_el unique values")
    return resize!(array, i)
end

(not tested, so there might be bugs, but I hope the idea is clear)

xiaodai · May 28, 2021, 11:27am

this is faster

const array = rand(1:300, 1000);

using Random: shuffle!

function subsample_unique4!(array, target_unique_el)
    shuffle!(array)

    maxval = maximum(array)

    has_array = falses(maxval)

    unique_count = 0
    @inbounds for (i, a) in enumerate(array)
        if !has_array[a]
            has_array[a] = true
            unique_count += 1
            if unique_count > target_unique_el
                return resize!(array, i-1)
            end
        end
    end
end

aa = deepcopy(array)
b = subsample_unique4!(aa, 100);

@assert length(unique(b)) == 100

@benchmark subsample_unique4!(run, 100) setup=(run=deepcopy(array)) evals=1

BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  2
  --------------
  minimum time:     6.400 μs (0.00% GC)
  median time:      7.600 μs (0.00% GC)
  mean time:        9.015 μs (6.56% GC)
  maximum time:     5.941 ms (99.54% GC)
  --------------
  samples:          10000
  evals/sample:     1

mlanghinrichs · May 28, 2021, 12:18pm

Thanks both of you! Your last two versions seem on par for the test array here. For my ‘real’ array the last version is much faster in my hands (maybe has something to do with the maximum value around 2_000_000). Really helpful altogether!

EDIT: If I replace BitSet() with Set() in @bkamins version, this is also almost the same performance for my ‘real’ array.

bkamins · May 28, 2021, 12:46pm

Note that the implementation by @xiaodai assumes that you have only positive entries in your data.

xiaodai · May 28, 2021, 1:28pm

u can probably do it faster by customizing the shuffle! by basically rolling your own and keep count of the number of uniques encountered so far and stop shuffling as soon as the target number has been exceeded.

mlanghinrichs · May 28, 2021, 1:51pm

thank you both, these are good points! currently, I’m fine with this, as now another method is (again) the main run time bottleneck of my code. was great to get this in the low µs range!

Seif_Shebl · May 29, 2021, 10:32am

This should be blazing fast! The main idea is to shuffle only required indices of the array as you go, and keep track of the number of unique elements. One the target number is reached, you stop the loop and resize the array accordingly, hence zero allocations. This basically is the idea of @xiaodai given in his last comment.

function subsample_unique3!(array, target_unique_el)
    n = length(array)
    i = 0
    count = 0
    while count < target_unique_el
        i += 1
        
        # shuffle current element
        r = rand(i:n)
        array[i], array[r] = array[r], array[i]
        
        ai = array[i]
        unique_i = true
        for j = 1:i-1 
            if array[j] == ai
                unique_i = false  
                break 
            end 
        end
        count += unique_i 
    end
    resize!(array,i)
end

array = rand(1:2_000_000, 1000);
aa = deepcopy(array);
b = subsample_unique3!(aa, 100);
@assert length(unique(b)) == 100

@btime subsample_unique3!(run, 100) setup=(run=deepcopy(array)) evals=1
  3.100 μs (0 allocations: 0 bytes)

xiaodai · May 29, 2021, 10:34am

you need to exceed it then scale back

xiaodai · May 31, 2021, 5:12am

if that’s not the case, can probably use some offset arrays package to handle that case

bkamins · May 31, 2021, 9:18am

Yes. Or manually calculate the offset.

mlanghinrichs · May 31, 2021, 9:47am

nice! For the test array here it is indeed faster (3.867 μs (0 allocations: 0 bytes) vs.
6.300 μs (2 allocations: 160 bytes)), for my real array however it is not (47.610 μs (0 allocations: 0 bytes) vs. 24.802 μs (3 allocations: 228.05 KiB)). The reason might be that the two nested loops are inefficient compared to one loop + lookup array (with some allocations) (?).

Topic		Replies	Views
How can write a function to find unique elements in array without any allocation? General Usage	4	9257	January 31, 2020
Sample without replacement.. and without StatsBase or shuffle General Usage	12	374	May 3, 2024
How do I create an array with random unique numbers in a specific range? New to Julia question , random	20	1409	June 24, 2024
Efficiently resample a large vector? Performance	5	590	June 23, 2022
Fastest way to `unique!` General Usage	17	3182	October 11, 2020

Sub-sample array until a target number of unique elements is left

Related topics