I am trying to sample data from population within loop but that produces way too much allocations and making my algorithms lot slower. I need a way to sample data with no more than 1 allocation. How should I proceed??? Is there any inplace random number generator in julia.
Note that calls to rand()
do not allocate. It will only allocate if you construct a vector, ie. rand(10)
.
I need to generate random numbers within particular range, like 1: length(X).
I am trying mini batch gradient descent where I need small small batches of data, is it possible to sample without constructing a vector?
Something like
julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
235792464
235792496
293017040
julia> v .= rand(1:100, 3)
3-element Vector{Int64}:
78
12
50
?
julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
0
0
0
julia> v .= rand.(Ref(1:100))
3-element Vector{Int64}:
8
71
39
is much better as it doesn’t allocate yet another array
Like this? Notice the 0-allocations.
function take_sample!(data, sample)
N = length(data)
for i = 1:100
for j in eachindex(sample)
sample[j] = data[rand(1:N)]
end
end
end
N = 100
data = rand(-5:5, N)
sample = fill(0,5)
@btime take_sample!($data, $sample)
3.750 μs (0 allocations: 0 bytes)
Or a faster version using rand():
function take_sample!(data, sample)
N = length(data)
for i = 1:100
for j in eachindex(sample)
sample[j] = data[trunc(Int,N*rand()+1)]
end
end
end
@btime take_sample!($data, $sample)
1.050 μs (0 allocations: 0 bytes)
You can get it even a bit faster if you use a vector for rand() since it could be SIMD’ed.
This works good but I am unsure why Ref is used here?
Because otherwise you’d broadcast also 1:100
. Instead you want to say “for each iteration call rand(1:100)
”
However this is a sample with replacement
Hm. Did a quick check., would have expected a difference:
using Random, BenchmarkTools
Random.seed!(42)
v1 = Vector{Int}(undef, 100000)
v1 .= rand(1:2, length(v1))
@show v1
Random.seed!(42)
v2 = Vector{Int}(undef, 100000)
v2 .= rand.(Ref(1:2))
@show v2
@assert v1 == v2
nah they’re both samples with replacement.
Imagine you have 1M data points and you want to take a sample of 1000 of them… the question is will any of those 1M data points be included 2 or 3 or 4 etc times? Or is each data point included at most once?
To sample without replacement, you can just shuffle the indices and create a view:
mydata = rand(1_000_000);
indices = collect(1:1_000_000)
while ! done
shuffle!(indices)
subsample = @view(mydata[indices[1:100]])
... work with the subsample
end