Let’s say I have a matrix like:
m = rand(50,100)
And now I want to select, say, 10% of the data in this matrix (a view will work, I don’t need a new array).
import Random
m[:, Random.randperm(100)[1:10]]
Is there a better way than this?
Let’s say I have a matrix like:
m = rand(50,100)
And now I want to select, say, 10% of the data in this matrix (a view will work, I don’t need a new array).
import Random
m[:, Random.randperm(100)[1:10]]
Is there a better way than this?
Random.randsubseq(m, 0.1)
The StatsBase.jl package has other possible sampling methods, e.g. StatsBase.seqsample_a!
, depending on what precisely you want to do.
Here is a simple way using Random
’s most common function rand()
- hope it is correct too:
M = rand(50,100)
p = 0.10
n = round(Int, p*length(M))
rand(M, n)
NB: this allows repetitions, not what is required.
This samples with replacement, but I think the OP wants without.
Edit: The source code for Random.randsubseq!
is quite an interesting algorithm:
## randsubseq & randsubseq!
# Fill S (resized as needed) with a random subsequence of A, where
# each element of A is included in S with independent probability p.
# (Note that this is different from the problem of finding a random
# size-m subset of A where m is fixed!)
function randsubseq!(r::AbstractRNG, S::AbstractArray, A::AbstractArray, p::Real)
require_one_based_indexing(S, A)
0 <= p <= 1 || throw(ArgumentError("probability $p not in [0,1]"))
n = length(A)
p == 1 && return copyto!(resize!(S, n), A)
empty!(S)
p == 0 && return S
nexpected = p * length(A)
sizehint!(S, round(Int,nexpected + 5*sqrt(nexpected)))
if p > 0.15 # empirical threshold for trivial O(n) algorithm to be better
for i = 1:n
rand(r) <= p && push!(S, A[i])
end
else
# Skip through A, in order, from each element i to the next element i+s
# included in S. The probability that the next included element is
# s==k (k > 0) is (1-p)^(k-1) * p, and hence the probability (CDF) that
# s is in {1,...,k} is 1-(1-p)^k = F(k). Thus, we can draw the skip s
# from this probability distribution via the discrete inverse-transform
# method: s = ceil(F^{-1}(u)) where u = rand(), which is simply
# s = ceil(log(rand()) / log1p(-p)).
# -log(rand()) is an exponential variate, so can use randexp().
L = -1 / log1p(-p) # L > 0
i = 0
while true
s = randexp(r) * L
s >= n - i && return S # compare before ceil to avoid overflow
push!(S, A[i += ceil(Int,s)])
end
# [This algorithm is similar in spirit to, but much simpler than,
# the one by Vitter for a related problem in "Faster methods for
# random sampling," Comm. ACM Magazine 7, 703-718 (1984).]
end
return S
end
Yes, you are right, repetitions are allowed in the rand()
example.
For distinct 10% values we could use sample()
:
using StatsBase
M = rand(50,100)
p = 0.10
n = round(Int, p*length(M))
sample(M, n; replace=false)
But the dedicated randsubseq()
function suggested by Steve has simpler syntax.
sample
(with replace=false
), randsubseq
, and rand
all appear to do different things:
rand(A, n)
returns exactly n
elements sampled from A
with replacement.sample(A, n, replace=false)
returns exactly n
elements sampled from A
without replacement.randsubseq(A, p)
returns a subset of A
where each element is included with probability p
. (Thus, the number of elements in the output is a binomial random variable with probability p
: its length can be different every time.)julia> using Random
julia> A = 1:10
1:10
julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
10
julia> randsubseq(A, 0.1)
Int64[]
julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
5
julia> randsubseq(A, 0.1)
2-element Vector{Int64}:
7
8
julia> randsubseq(A, 0.1)
1-element Vector{Int64}:
3
I think sample(A, n, replace=false)
is what the OP is after.
Yup, sample
is what I was looking for. Thanks!
Only one thing - how do I get columns of the matrix instead of just values from the matrix?
You could use again sample
together with view
:
view(m, :, sample(1:100,10,replace=false))
Huh, why does this result in fewer allocations than sample
alone? I expected that the call to sample
inside of view
would require allocating (defeating the purpose of the view
), but @allocated
reveals that this is not the case.
Not an explanation but an observation: the random columns selected are complete, so we may take views over them.
In my experiment, the difference was stark even for array input:
julia> using StatsBase
julia> a = rand(1000);
julia> @allocated sample(a, 50)
22087420
julia> @allocated view(a, sample(1:1000, 50; replace=false))
2263532
There might be some problem with your experiment. I get different results, confirmed also by using @btime
and interpolating the variables:
a = rand(1000);
@btime sample($a, 50) # 298 ns (1 alloc: 496 bytes)
@btime view($a, sample(1:1000, 50; replace=false)) # 880 ns (2 allocs: 8.4 KiB)