# How to sample data without using external packages?

I am trying to sample data from population within loop but that produces way too much allocations and making my algorithms lot slower. I need a way to sample data with no more than 1 allocation. How should I proceed??? Is there any inplace random number generator in julia.

Uniform: rand!

Normal: randn!

Exponential: randexp!

Note that calls to `rand()` do not allocate. It will only allocate if you construct a vector, ie. `rand(10)`.

I need to generate random numbers within particular range, like 1: length(X).

I am trying mini batch gradient descent where I need small small batches of data, is it possible to sample without constructing a vector?

Something like

``````julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
235792464
235792496
293017040

julia> v .= rand(1:100, 3)
3-element Vector{Int64}:
78
12
50
``````

?

1 Like
``````julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
0
0
0

julia> v .= rand.(Ref(1:100))
3-element Vector{Int64}:
8
71
39
``````

is much better as it doesn’t allocate yet another array

7 Likes

Like this? Notice the 0-allocations.

``````function take_sample!(data, sample)
N = length(data)
for i = 1:100
for j in eachindex(sample)
sample[j] = data[rand(1:N)]
end
end
end

N = 100
data = rand(-5:5, N)
sample = fill(0,5)
@btime take_sample!(\$data, \$sample)
3.750 μs (0 allocations: 0 bytes)
``````

Or a faster version using rand():

``````function take_sample!(data, sample)
N = length(data)
for i = 1:100
for j in eachindex(sample)
sample[j] = data[trunc(Int,N*rand()+1)]
end
end
end
@btime take_sample!(\$data, \$sample)
1.050 μs (0 allocations: 0 bytes)
``````

You can get it even a bit faster if you use a vector for rand() since it could be SIMD’ed.

1 Like

This works good but I am unsure why Ref is used here?

Because otherwise you’d broadcast also `1:100`. Instead you want to say "for each iteration call `rand(1:100)`"

2 Likes

However this is a sample with replacement

1 Like

Hm. Did a quick check., would have expected a difference:

``````using Random, BenchmarkTools

Random.seed!(42)

v1 = Vector{Int}(undef, 100000)
v1 .= rand(1:2, length(v1))
@show v1

Random.seed!(42)

v2 = Vector{Int}(undef, 100000)
v2 .= rand.(Ref(1:2))
@show v2
@assert v1 == v2
``````

nah they’re both samples with replacement.

Imagine you have 1M data points and you want to take a sample of 1000 of them… the question is will any of those 1M data points be included 2 or 3 or 4 etc times? Or is each data point included at most once?

To sample without replacement, you can just shuffle the indices and create a view:

``````mydata = rand(1_000_000);

indices = collect(1:1_000_000)
while ! done
shuffle!(indices)
subsample = @view(mydata[indices[1:100]])
... work with the subsample
end
``````
1 Like