How to sample data without using external packages?

I am trying to sample data from population within loop but that produces way too much allocations and making my algorithms lot slower. I need a way to sample data with no more than 1 allocation. How should I proceed??? Is there any inplace random number generator in julia.

Uniform: rand!

Normal: randn!

Exponential: randexp!

Note that calls to rand() do not allocate. It will only allocate if you construct a vector, ie. rand(10).

I need to generate random numbers within particular range, like 1: length(X).

I am trying mini batch gradient descent where I need small small batches of data, is it possible to sample without constructing a vector?

Something like

julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
 235792464
 235792496
 293017040

julia> v .= rand(1:100, 3)
3-element Vector{Int64}:
 78
 12
 50

?

1 Like
julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
 0
 0
 0

julia> v .= rand.(Ref(1:100))
3-element Vector{Int64}:
  8
 71
 39

is much better as it doesn’t allocate yet another array

7 Likes

Like this? Notice the 0-allocations.

function take_sample!(data, sample)
    N = length(data)
    for i = 1:100
        for j in eachindex(sample)
            sample[j] = data[rand(1:N)]
        end
    end 
end

N = 100
data = rand(-5:5, N)
sample = fill(0,5)
@btime take_sample!($data, $sample)
  3.750 μs (0 allocations: 0 bytes)

Or a faster version using rand():

function take_sample!(data, sample)
    N = length(data)
    for i = 1:100
        for j in eachindex(sample)
            sample[j] = data[trunc(Int,N*rand()+1)]
        end
    end 
end
@btime take_sample!($data, $sample)
  1.050 μs (0 allocations: 0 bytes)

You can get it even a bit faster if you use a vector for rand() since it could be SIMD’ed.

1 Like

This works good but I am unsure why Ref is used here?

Because otherwise you’d broadcast also 1:100. Instead you want to say “for each iteration call rand(1:100)

2 Likes

However this is a sample with replacement

1 Like

Hm. Did a quick check., would have expected a difference:

using Random, BenchmarkTools

Random.seed!(42)

v1 = Vector{Int}(undef, 100000)
v1 .= rand(1:2, length(v1))
@show v1

Random.seed!(42)

v2 = Vector{Int}(undef, 100000)
v2 .= rand.(Ref(1:2))
@show v2
@assert v1 == v2

nah they’re both samples with replacement.

Imagine you have 1M data points and you want to take a sample of 1000 of them… the question is will any of those 1M data points be included 2 or 3 or 4 etc times? Or is each data point included at most once?

To sample without replacement, you can just shuffle the indices and create a view:

mydata = rand(1_000_000);

indices = collect(1:1_000_000)
while ! done
   shuffle!(indices)
   subsample = @view(mydata[indices[1:100]])
   ... work with the subsample
end
1 Like