# Efficient repeated sampling of small vector

Hi, my problem is simple: I am currently playing around with Multi-Armed Bandits and long story short in every iteration step I need to sample 10 iid. random variables (one for each arm).

I noticed that it takes significantly longer to sample this small vector length repeatedly instead of using a longer array:

``````using BenchmarkTools, Random
a = zeros(10000)
@btime randn!(a) # 14.500 μs
b = zeros(10)
foo(a) = for _ in 1:1000 randn!(b) end
@btime foo(a) # 53.125 μs
``````

My (uneducated) guess is that this results from SIMD/AVX optimizations. Therefore I would like to have a some kind of workaround/sampler/generator which allows me to use `rand!` just as before but internally samples bigger vector sizes. My first naive attempt looks like this:

``````using Distributions
struct RandomGenerator{T<:AbstractVector}
state
count
sampler
preallocation::T
RandomGenerator(sampler, count=256) = new{Vector{Float64}}(
Ref(1), count, sampler, rand(sampler, count)
)
end

function rand!(x::RandomGenerator, dest::T) where {T<:AbstractVector}
i = 1
len = length(dest)
while i < len
max_length = min(x.count-x.state[]+1, len-i)
dest[i:i+max_length] = x.preallocation[x.state[]:x.state[]+max_length]
i += max_length
x.state[] += max_length
if x.state[] >= x.count[]
x.state[] = 1
rand!(x.sampler, x.preallocation)
end
end
end

gen = RandomGenerator(Normal())
results = zeros(10)
rand!(gen, results)
``````

I am just wondering whether this can be done conceptually smarter or more idiomatic, respectively (e.g. with an `Iterable` interface).

Here is an approach:

``````using BenchmarkTools
using Random

struct RandomGenerator{T}
# Type parameterize your struct fields for better type inference
values::T
function RandomGenerator(type, preallocation_size)
seed = rand(type, preallocation_size)
# Using tools from Iterators to handle the state stuff
# You might not need Iterators.cycle if you know exactly how many values you'll need
# but cycle will let it loop back around if you ask for too many
iterator = Iterators.Stateful(Iterators.cycle(seed))
new{typeof(iterator)}(iterator)
end
end

function Random.rand!(rg::RandomGenerator, dest)
vals = Iterators.take(rg.values, length(dest))

# Generally best practice to use eachindex not 1:length(x) since not
# all arrays start at 1
for (i, val) in zip(eachindex(dest), vals)
@inbounds dest[i] = val
end
return dest
end

# A couple of functions to test performance
function standard_rand(dest_length, dest_n)
# I'm only allocating dest once since the allocations are going to be the slowest part
dest = zeros(dest_length)
for _ in 1:dest_n
rand!(dest)
end
return dest
end

function preall_rand(dest_length, dest_n)
rg = RandomGenerator(Float64, dest_length * dest_n)
x = zeros(dest_length)
for _ in 1:dest_n
rand!(rg, x)
end
return x
end

@btime standard_rand(10, 1000)
# 44.600 μs (1 allocation: 144 bytes)
@btime preall_rand(10, 1000)
# 16.100 μs (7 allocations: 78.41 KiB)
``````

Another question to ask yourself is, do you really need a mutable, materialized array of random numbers? If not, you could return a view of the values in RandomGenerator instead of copying the values into `dest`.

Edit: just put together that you want to guarantee each subsample has a normal distribution. Headed to bed now, but I can think about this again tomorrow unless someone beats me to it!

1 Like

This has issues with global variables benchmarking, you avoid it with `\$`-interpolation in the macro expression. And the function appears to not use its argument and uses a type-unstable global `b`, which I assume was a mistake. Is this what you intended?

``````a = zeros(10000)
@btime randn!(\$a) #
b = zeros(10)
foo(x) = for _ in 1:1000 randn!(x) end
@btime foo(\$b) #
``````

Doesn’t seem to change the performance discrepancy much.

1 Like

Maybe StaticArrays.jl can be useful?

1 Like

Yes, you are right of course (that was a 4am kinda stupidity ). But as you said, adding `\$` for the benchmark and a `const` keyword for the arrays doesn’t change too much.

I will look into your approach more deeply later; regarding your last question: Yes, a view would do it, I don’t need a materialized array.

That’s a nice approach, one issue I have is that `Iterators.cycle(seed)` starts again with the same sample. Therefore we only sample `preallocation_size` many values. I however want to sample infinitely many different values (abstracting from the fact that we only have a pseudorandomness).

Not Julia specific, but why do you need to sample each arm on every iteration? In the standard MAB, only a single arm’s random outcome is observed, so I would think you can sample just that one.

That’s a valid point. Firstly, if I sample only one value per iteration it is even more inperformant and a special case of our example. Secondly, some learning strategies sample one value per arm per iteration (Boltzmann-Gumbel exploration e.g.).

In other words: Our problem arises naturally

After playing a bit around, I have modified your approach to a version with infinite samples:

``````mutable struct RandomGenerator{S, T}
seed::S
values::T
function RandomGenerator(type, preallocation_size)
seed = rand(type, preallocation_size)
iterator = Iterators.Stateful(seed)
new{typeof(seed), typeof(iterator)}(seed, iterator)
end
end

function Random.rand!(rg::RandomGenerator, dest)
vals = Iterators.take(rg.values, length(dest))
if length(vals) != length(dest)
rand!(rg.seed)
rg.values = Iterators.Stateful(rg.seed)
Random.rand!(rg, dest)
else
for (i, val) in zip(eachindex(dest), vals)
@inbounds dest[i] = val
end
return dest
end
end
``````

The performance is slightly worse (`18s` vs `16s`).

If (as you mentioned earlier) I just want to use these values in some calculation, I can use

``````function Random.rand(rg::RandomGenerator, dest_length::Integer)
vals = Iterators.take(rg.values, dest_length)
if length(vals) != dest_length
rand!(rg.seed)
rg.values = Iterators.Stateful(rg.seed)
Random.rand(rg, dest_length)
else
return vals
end
end
``````

and then iterate over our return value. For example:

``````function preall_rand(dest_length, dest_n)
rg = RandomGenerator(Float64, dest_length * dest_n)
x = zeros(dest_length)
for _ in 1:dest_n
vals = rand(rg, dest_length)
for (i, val) in zip(1:dest_length, vals)
@inbounds x[i] = val + 1.0 # e.g. calculation with arbitrary +1
end
end
return x
end
``````

However, I noticed that if I use broadcasting via `@inbounds x .= vals .+ 1.0`, it gets much slower. Why is that? Can I avoid that? Should I somehow use `view`?

Edit: I see now why broadcasting is slow: iterables get collected. That is unfortunate, since I would like to use broadcasting in my user interface.

1 Like

I think I will stick with this solution using `view` for now:

``````mutable struct RandomGenerator{S, T}
count::S
values::T
RandomGenerator(type, preallocation_size) = begin
values = rand(type, preallocation_size)
new{type, typeof(values)}(1, values)
end
end

function Random.rand(rg::RandomGenerator, len::Integer)
lower = rg.count
upper = lower + len - 1
if upper > length(rg.values)
@assert (length(rg.values) >= len) "your requested length is too long"
Random.rand!(rg.values)
rg.count = 1
Random.rand(rg, len)
else
vals = view(rg.values, lower:upper)
rg.count = upper + 1
return vals
end
end
``````

Thanks for the numerous suggestions and tips!

You have a typo. You input `a` but modify the global variable `b`.

1 Like

Thanks @DNF, I will just leave it as is to keep this thread readable.

This line in `RandomGenerator` constructor should be amended since `1` is not necessarily a `typeof(values)`.

1 Like