How to sample data without using external packages?

Dharik_Arsath · March 1, 2022, 1:52pm

I am trying to sample data from population within loop but that produces way too much allocations and making my algorithms lot slower. I need a way to sample data with no more than 1 allocation. How should I proceed??? Is there any inplace random number generator in julia.

lawless-m · March 1, 2022, 2:14pm

Uniform: rand!

Normal: randn!

Exponential: randexp!

tbeason · March 1, 2022, 2:18pm

Note that calls to rand() do not allocate. It will only allocate if you construct a vector, ie. rand(10).

Dharik_Arsath · March 1, 2022, 2:33pm

I need to generate random numbers within particular range, like 1: length(X).

Dharik_Arsath · March 1, 2022, 2:34pm

I am trying mini batch gradient descent where I need small small batches of data, is it possible to sample without constructing a vector?

goerch · March 1, 2022, 2:36pm

Something like

julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
 235792464
 235792496
 293017040

julia> v .= rand(1:100, 3)
3-element Vector{Int64}:
 78
 12
 50

?

giordano · March 1, 2022, 2:40pm

julia> v = Vector{Int}(undef, 3)
3-element Vector{Int64}:
 0
 0
 0

julia> v .= rand.(Ref(1:100))
3-element Vector{Int64}:
  8
 71
 39

is much better as it doesn’t allocate yet another array

Seif_Shebl · March 1, 2022, 3:16pm

Like this? Notice the 0-allocations.

function take_sample!(data, sample)
    N = length(data)
    for i = 1:100
        for j in eachindex(sample)
            sample[j] = data[rand(1:N)]
        end
    end 
end

N = 100
data = rand(-5:5, N)
sample = fill(0,5)
@btime take_sample!($data, $sample)
  3.750 μs (0 allocations: 0 bytes)

Or a faster version using rand():

function take_sample!(data, sample)
    N = length(data)
    for i = 1:100
        for j in eachindex(sample)
            sample[j] = data[trunc(Int,N*rand()+1)]
        end
    end 
end
@btime take_sample!($data, $sample)
  1.050 μs (0 allocations: 0 bytes)

You can get it even a bit faster if you use a vector for rand() since it could be SIMD’ed.

Dharik_Arsath · March 1, 2022, 5:26pm

This works good but I am unsure why Ref is used here?

giordano · March 1, 2022, 5:34pm

Because otherwise you’d broadcast also 1:100. Instead you want to say “for each iteration call rand(1:100)”

dlakelan · March 1, 2022, 5:36pm

However this is a sample with replacement

goerch · March 1, 2022, 5:41pm

Hm. Did a quick check., would have expected a difference:

using Random, BenchmarkTools

Random.seed!(42)

v1 = Vector{Int}(undef, 100000)
v1 .= rand(1:2, length(v1))
@show v1

Random.seed!(42)

v2 = Vector{Int}(undef, 100000)
v2 .= rand.(Ref(1:2))
@show v2
@assert v1 == v2

dlakelan · March 1, 2022, 9:05pm

nah they’re both samples with replacement.

Imagine you have 1M data points and you want to take a sample of 1000 of them… the question is will any of those 1M data points be included 2 or 3 or 4 etc times? Or is each data point included at most once?

To sample without replacement, you can just shuffle the indices and create a view:

mydata = rand(1_000_000);

indices = collect(1:1_000_000)
while ! done
   shuffle!(indices)
   subsample = @view(mydata[indices[1:100]])
   ... work with the subsample
end

Topic		Replies	Views
Efficient repeated sampling of small vector Performance	13	418	April 8, 2023
Sampling without replacement Statistics statistics	9	20487	October 5, 2023
Sampling from a list of integers without repetition New to Julia	3	6934	June 15, 2020
Customize a random function to sample 3 out of a list of 4097 real numbers Performance question	5	256	July 7, 2023
Random specific in julia New to Julia	3	633	January 16, 2018

How to sample data without using external packages?

Related topics