Randomly select x% of elements in a array/matrix

Let’s say I have a matrix like:

m = rand(50,100)

And now I want to select, say, 10% of the data in this matrix (a view will work, I don’t need a new array).

import Random
m[:, Random.randperm(100)[1:10]]

Is there a better way than this?

1 Like
Random.randsubseq(m, 0.1)

The StatsBase.jl package has other possible sampling methods, e.g. StatsBase.seqsample_a!, depending on what precisely you want to do.

2 Likes

Here is a simple way using Random’s most common function rand() - hope it is correct too:

M = rand(50,100)
p = 0.10
n = round(Int, p*length(M))
rand(M, n)

NB: this allows repetitions, not what is required.

This samples with replacement, but I think the OP wants without.

Edit: The source code for Random.randsubseq! is quite an interesting algorithm:

## randsubseq & randsubseq!

# Fill S (resized as needed) with a random subsequence of A, where
# each element of A is included in S with independent probability p.
# (Note that this is different from the problem of finding a random
#  size-m subset of A where m is fixed!)
function randsubseq!(r::AbstractRNG, S::AbstractArray, A::AbstractArray, p::Real)
    require_one_based_indexing(S, A)
    0 <= p <= 1 || throw(ArgumentError("probability $p not in [0,1]"))
    n = length(A)
    p == 1 && return copyto!(resize!(S, n), A)
    empty!(S)
    p == 0 && return S
    nexpected = p * length(A)
    sizehint!(S, round(Int,nexpected + 5*sqrt(nexpected)))
    if p > 0.15 # empirical threshold for trivial O(n) algorithm to be better
        for i = 1:n
            rand(r) <= p && push!(S, A[i])
        end
    else
        # Skip through A, in order, from each element i to the next element i+s
        # included in S. The probability that the next included element is
        # s==k (k > 0) is (1-p)^(k-1) * p, and hence the probability (CDF) that
        # s is in {1,...,k} is 1-(1-p)^k = F(k).   Thus, we can draw the skip s
        # from this probability distribution via the discrete inverse-transform
        # method: s = ceil(F^{-1}(u)) where u = rand(), which is simply
        # s = ceil(log(rand()) / log1p(-p)).
        # -log(rand()) is an exponential variate, so can use randexp().
        L = -1 / log1p(-p) # L > 0
        i = 0
        while true
            s = randexp(r) * L
            s >= n - i && return S # compare before ceil to avoid overflow
            push!(S, A[i += ceil(Int,s)])
        end
        # [This algorithm is similar in spirit to, but much simpler than,
        #  the one by Vitter for a related problem in "Faster methods for
        #  random sampling," Comm. ACM Magazine 7, 703-718 (1984).]
    end
    return S
end
1 Like

Yes, you are right, repetitions are allowed in the rand() example.

For distinct 10% values we could use sample():

using StatsBase
M = rand(50,100)
p = 0.10
n = round(Int, p*length(M))
sample(M, n; replace=false)

But the dedicated randsubseq() function suggested by Steve has simpler syntax.

sample (with replace=false), randsubseq, and rand all appear to do different things:

  • rand(A, n) returns exactly n elements sampled from A with replacement.
  • sample(A, n, replace=false) returns exactly n elements sampled from A without replacement.
  • randsubseq(A, p) returns a subset of A where each element is included with probability p. (Thus, the number of elements in the output is a binomial random variable with probability p: its length can be different every time.)
julia> using Random

julia> A = 1:10
1:10

julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
 10

julia> randsubseq(A, 0.1)
Int64[]

julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
 5

julia> randsubseq(A, 0.1)
2-element Vector{Int64}:
 7
 8

julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
 3

I think sample(A, n, replace=false) is what the OP is after.

2 Likes

Yup, sample is what I was looking for. Thanks!
Only one thing - how do I get columns of the matrix instead of just values from the matrix?

You could use again sample together with view:

view(m, :, sample(1:100,10,replace=false))

Huh, why does this result in fewer allocations than sample alone? I expected that the call to sample inside of view would require allocating (defeating the purpose of the view), but @allocated reveals that this is not the case.

Not an explanation but an observation: the random columns selected are complete, so we may take views over them.

In my experiment, the difference was stark even for array input:

julia> using StatsBase

julia> a = rand(1000);

julia> @allocated sample(a, 50)
22087420

julia> @allocated view(a, sample(1:1000, 50; replace=false))
2263532

There might be some problem with your experiment. I get different results, confirmed also by using @btime and interpolating the variables:

a = rand(1000);
@btime sample($a, 50)                               # 298 ns (1 alloc: 496 bytes)
@btime view($a, sample(1:1000, 50; replace=false))  # 880 ns (2 allocs: 8.4 KiB)
1 Like