# Randomly select x% of elements in a array/matrix

Let’s say I have a matrix like:

``````m = rand(50,100)
``````

And now I want to select, say, 10% of the data in this matrix (a view will work, I don’t need a new array).

``````import Random
m[:, Random.randperm(100)[1:10]]
``````

Is there a better way than this?

1 Like
``````Random.randsubseq(m, 0.1)
``````

The StatsBase.jl package has other possible sampling methods, e.g. `StatsBase.seqsample_a!`, depending on what precisely you want to do.

2 Likes

Here is a simple way `using Random`’s most common function `rand()` - hope it is correct too:

``````M = rand(50,100)
p = 0.10
n = round(Int, p*length(M))
rand(M, n)
``````

NB: this allows repetitions, not what is required.

This samples with replacement, but I think the OP wants without.

Edit: The source code for `Random.randsubseq!` is quite an interesting algorithm:

``````## randsubseq & randsubseq!

# Fill S (resized as needed) with a random subsequence of A, where
# each element of A is included in S with independent probability p.
# (Note that this is different from the problem of finding a random
#  size-m subset of A where m is fixed!)
function randsubseq!(r::AbstractRNG, S::AbstractArray, A::AbstractArray, p::Real)
require_one_based_indexing(S, A)
0 <= p <= 1 || throw(ArgumentError("probability \$p not in [0,1]"))
n = length(A)
p == 1 && return copyto!(resize!(S, n), A)
empty!(S)
p == 0 && return S
nexpected = p * length(A)
sizehint!(S, round(Int,nexpected + 5*sqrt(nexpected)))
if p > 0.15 # empirical threshold for trivial O(n) algorithm to be better
for i = 1:n
rand(r) <= p && push!(S, A[i])
end
else
# Skip through A, in order, from each element i to the next element i+s
# included in S. The probability that the next included element is
# s==k (k > 0) is (1-p)^(k-1) * p, and hence the probability (CDF) that
# s is in {1,...,k} is 1-(1-p)^k = F(k).   Thus, we can draw the skip s
# from this probability distribution via the discrete inverse-transform
# method: s = ceil(F^{-1}(u)) where u = rand(), which is simply
# s = ceil(log(rand()) / log1p(-p)).
# -log(rand()) is an exponential variate, so can use randexp().
L = -1 / log1p(-p) # L > 0
i = 0
while true
s = randexp(r) * L
s >= n - i && return S # compare before ceil to avoid overflow
push!(S, A[i += ceil(Int,s)])
end
# [This algorithm is similar in spirit to, but much simpler than,
#  the one by Vitter for a related problem in "Faster methods for
#  random sampling," Comm. ACM Magazine 7, 703-718 (1984).]
end
return S
end
``````
1 Like

Yes, you are right, repetitions are allowed in the `rand()` example.

For distinct 10% values we could use `sample()`:

``````using StatsBase
M = rand(50,100)
p = 0.10
n = round(Int, p*length(M))
sample(M, n; replace=false)
``````

But the dedicated `randsubseq()` function suggested by Steve has simpler syntax.

`sample` (with `replace=false`), `randsubseq`, and `rand` all appear to do different things:

• `rand(A, n)` returns exactly `n` elements sampled from `A` with replacement.
• `sample(A, n, replace=false)` returns exactly `n` elements sampled from `A` without replacement.
• `randsubseq(A, p)` returns a subset of `A` where each element is included with probability `p`. (Thus, the number of elements in the output is a binomial random variable with probability `p`: its length can be different every time.)
``````julia> using Random

julia> A = 1:10
1:10

julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
10

julia> randsubseq(A, 0.1)
Int64[]

julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
5

julia> randsubseq(A, 0.1)
2-element Vector{Int64}:
7
8

julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
3
``````

I think `sample(A, n, replace=false)` is what the OP is after.

2 Likes

Yup, `sample` is what I was looking for. Thanks!
Only one thing - how do I get columns of the matrix instead of just values from the matrix?

You could use again `sample` together with `view`:

``````view(m, :, sample(1:100,10,replace=false))
``````

Huh, why does this result in fewer allocations than `sample` alone? I expected that the call to `sample` inside of `view` would require allocating (defeating the purpose of the `view`), but `@allocated` reveals that this is not the case.

Not an explanation but an observation: the random columns selected are complete, so we may take views over them.

In my experiment, the difference was stark even for array input:

``````julia> using StatsBase

julia> a = rand(1000);

julia> @allocated sample(a, 50)
22087420

julia> @allocated view(a, sample(1:1000, 50; replace=false))
2263532
``````

There might be some problem with your experiment. I get different results, confirmed also by using `@btime` and interpolating the variables:

``````a = rand(1000);
@btime sample(\$a, 50)                               # 298 ns (1 alloc: 496 bytes)
@btime view(\$a, sample(1:1000, 50; replace=false))  # 880 ns (2 allocs: 8.4 KiB)
``````
1 Like