Sample without replacement.. and without StatsBase or shuffle

Hi, is there a way to sample 3 ids from 1:10 without replacement but also without using StatsBase (or any other package) or shuffling/randperm the whole array (in my case it could be relatively big) ?

If you need to sample few indices from a large pool of indices, then the following is super simple and should be fairly efficient:

function random_indices(n, k)
    while true
       ix = rand(1:n, k)
       allunique(ix) && return ix
    end
end

The probability of randomly picking k unique elements out of n is P=\frac{n (n - 1) \ldots (n - k + 1)}{n^k} = \frac{n!}{n^k (n-k)!}. The probability of performing at most m iterations of the above algorithm before returning a value is Q = 1 - (1 - P)^m.

You can check for yourself what the probabilities are in your case:

P(n, k) = binomial(n, k) * (factorial(k) / (n^k))
Q(n, k, m) = 1 - (1 - P(n, k))^m

# prob. of getting 2 random indices out of 10 without replacement 
# in at most two iterations of the above algorithm
julia> Q(10, 3, 2)
0.9216

You pay the cost of checking for uniqueness too, but again, if length(ix) is small…

Edit: perhaps a better implementation of the same idea might be

function random_indices(n, k)
    ix = sizehint!(Set{Int}(), k)
    while length(ix) < k
        push!(ix, rand(1:n))
    end
    return collect(ix)
end
2 Likes

A bit of benchmarks. The first implementation of random_indices freeze my pc as soon as the ratio between sampled and total goes above ~1/5, the second one works well.

julia> using Random, BenchmarkTools, StatsBase
julia> function sample_whr(data,n;rng=Random.GLOBAL_RNG)
           sampled = Array{eltype(data),1}(undef,n)
           datarem = Set(data)
           for i in 1:n
               s = rand(rng, datarem)
               sampled[i] = s
               setdiff!(datarem,Set(s))
           end
           return sampled
       end;
julia> function random_indices(n, k)
           while true
              ix = rand(1:n, k)
              allunique(ix) && return ix
           end
       end;
julia> function random_indices2(n, k)
           ix = sizehint!(Set{Int}(), k)
           while length(ix) < k
               push!(ix, rand(1:n))
           end
           return collect(ix)
       end;

julia> n    = 100000;
julia> ns   = 100;
julia> data = 1:n;
julia> @btime random_indices(n,ns);
  3.005 μs (11 allocations: 4.50 KiB)
julia> @btime random_indices2(n,ns);
  3.220 μs (8 allocations: 3.60 KiB)
julia> @btime sample_whr(data,ns);
  1.531 ms (408 allocations: 2.29 MiB)
julia> @btime randperm(n)[1:ns];
  779.186 μs (4 allocations: 782.20 KiB)
julia> @btime shuffle(data)[1:ns];
  1.017 ms (4 allocations: 782.20 KiB)
julia> @btime sample(1:n, ns; replace=false);
  2.995 μs (9 allocations: 3.63 KiB)

julia> n    = 1000;
julia> ns   = 150; # more than this and random_indices stuck my pc
julia> data = 1:n;
julia> @btime random_indices(n,ns);
  12.827 ms (39425 allocations: 14.53 MiB)
julia> @btime random_indices2(n,ns);
  4.838 μs (8 allocations: 4.05 KiB)
julia> @btime sample_whr(data,ns);
  44.517 μs (608 allocations: 78.51 KiB)
julia> @btime randperm(n)[1:ns];
  7.077 μs (3 allocations: 9.30 KiB)
julia> @btime shuffle(data)[1:ns];
  7.651 μs (3 allocations: 9.30 KiB)
julia> @btime sample(1:n, ns; replace=false);
  1.791 μs (3 allocations: 9.30 KiB)

julia> n    = 1000;
julia> ns   = 800;
julia> data = 1:n;
       #@btime random_indices(n,ns);
julia> @btime random_indices2(n,ns);
  43.442 μs (8 allocations: 24.96 KiB)
julia> @btime sample_whr(data,ns);
  165.601 μs (3208 allocations: 337.46 KiB)
julia> @btime randperm(n)[1:ns];
  8.062 μs (3 allocations: 14.34 KiB)
julia> @btime shuffle(data)[1:ns];
  8.096 μs (3 allocations: 14.34 KiB)
julia> @btime sample(1:n, ns; replace=false);
   5.924 μs (3 allocations: 14.34 KiB)

(EDIT: added StatsBaase.sample to the benchmarks, Julia 1.10)

Yeah, the first implementation allocates unnecessarily. Nevertheless, this approach is not suitable if ns is close to n (I am referring to the variables in your benchmarks). You should invest time into implementing a more complex, but smarter algorithm in that case (or make peace with StatsBase :slight_smile: ) .

I was thinking there might be a way to do it by bootstrapping off of Random.randomsubseq, which does Bernoulli sampling (and is included in the stdlib because it is used for sprand), but it’s not obvious to me how to efficiently get exactly k elements with the right statistics.

Sounds like a problem for Reservoir Sampling. A very naive approach to is generate n uniform random numbers in [0, 1) (e.g. rand()) and to then get the sort permutation. Return the first k of the sort indices:

function random_indices(n, k)
    xs = rand(n)
    p = sortperm(xs)
    return p[begin:k]
end

This is probably less efficient than the reservoir sampling method, so take a look at the wiki.

Edit: Here is my implementation of the optimal Algorithm L:

# samples k=length(dst) items from src without replacement
# stores the result in dst
function reservoir_sample!(dst, src)
    n = length(src)
    k = length(dst)
    i = 1

    while i <= k
        dst[i] = src[i]
        i += 1
    end

    W = exp(log(rand()) / k)

    while i <= n
        i = i + floor(Int, log(rand()) / log(1 - W)) + 1
        if i ≤ n
            dst[rand(1:k)] = src[i]
            W = W * exp(log(rand()) / k)
        end
    end
end

Realize that a whole bunch of these algorithms are already implemented in StatsBase.jl.

I’m not entirely sure why the OP doesn’t want to use StatsBase.

1 Like

If you actually care about the 3 case, this should be decently fast:

function sample3(n)
    i1 = rand(1:n)
    i2 = rand(1:(n - 1))
    i3 = rand(1:(n - 2))
    i2 += i2 >= i1
    i3 += i3 >= min(i1, i2)
    i3 += i3 >= max(i1, i2)
    return i1, i2, i3
end
3 Likes

I am aware, that’s how I came across reservoir sampling in the first place :slight_smile: I needed to implement (weighted) random sampling in a C# library and took inspiration from StatsBase.

For OP, I think reservoir sampling is small and easy enough to implement without taking on the extra dependency.

Also consider that this implementation of reservoir_sample is not really correct, it doesn’t respect that each ordering of the sample has the same probability e.g. consider what happens when n = k in your implementation, you should shuffle the output.

Depends on what kind of sampling you want—sometimes you want to draw samples sorted in the order they appear in the underlying array.

sometimes the order doesn’t matter but what I wanted to point out is that here it isn’t neither ordered nor random, e.g. for k = n - 1 you can have the last value in the array in any position if it gets in the sample

I was just hoping that as in base/STL we have sortperm, partialsortperm and randperm we would also have partialrandperm, but yes, I think using sample from StatsBase is the best way…