Efficient repeated sampling of small vector

Hi, my problem is simple: I am currently playing around with Multi-Armed Bandits and long story short in every iteration step I need to sample 10 iid. random variables (one for each arm).

I noticed that it takes significantly longer to sample this small vector length repeatedly instead of using a longer array:

using BenchmarkTools, Random
a = zeros(10000)
@btime randn!(a) # 14.500 μs
b = zeros(10)
foo(a) = for _ in 1:1000 randn!(b) end
@btime foo(a) # 53.125 μs

My (uneducated) guess is that this results from SIMD/AVX optimizations. Therefore I would like to have a some kind of workaround/sampler/generator which allows me to use rand! just as before but internally samples bigger vector sizes. My first naive attempt looks like this:

using Distributions
struct RandomGenerator{T<:AbstractVector}
    state
    count
    sampler
    preallocation::T
    RandomGenerator(sampler, count=256) = new{Vector{Float64}}(
        Ref(1), count, sampler, rand(sampler, count)
    )
end

function rand!(x::RandomGenerator, dest::T) where {T<:AbstractVector}
    i = 1
    len = length(dest)
    while i < len
        max_length = min(x.count-x.state[]+1, len-i)
        dest[i:i+max_length] = x.preallocation[x.state[]:x.state[]+max_length]
        i += max_length
        x.state[] += max_length
        if x.state[] >= x.count[]
            x.state[] = 1
            rand!(x.sampler, x.preallocation)
        end
    end
end

gen = RandomGenerator(Normal())
results = zeros(10)
rand!(gen, results)

I am just wondering whether this can be done conceptually smarter or more idiomatic, respectively (e.g. with an Iterable interface).

Here is an approach:

using BenchmarkTools
using Random

struct RandomGenerator{T}
    # Type parameterize your struct fields for better type inference
    values::T
    function RandomGenerator(type, preallocation_size)
        seed = rand(type, preallocation_size)
        # Using tools from Iterators to handle the state stuff
        # You might not need Iterators.cycle if you know exactly how many values you'll need
        # but cycle will let it loop back around if you ask for too many
        iterator = Iterators.Stateful(Iterators.cycle(seed))
        new{typeof(iterator)}(iterator)
    end
end

function Random.rand!(rg::RandomGenerator, dest)
    vals = Iterators.take(rg.values, length(dest))
    
    # Generally best practice to use eachindex not 1:length(x) since not
    # all arrays start at 1
    for (i, val) in zip(eachindex(dest), vals)
        @inbounds dest[i] = val
    end
    return dest
end

# A couple of functions to test performance
function standard_rand(dest_length, dest_n)
    # I'm only allocating dest once since the allocations are going to be the slowest part
    dest = zeros(dest_length)
    for _ in 1:dest_n
        rand!(dest)
    end
    return dest
end

function preall_rand(dest_length, dest_n)
    rg = RandomGenerator(Float64, dest_length * dest_n)
    x = zeros(dest_length)
    for _ in 1:dest_n
        rand!(rg, x)
    end
    return x
end

@btime standard_rand(10, 1000)
# 44.600 μs (1 allocation: 144 bytes)
@btime preall_rand(10, 1000)
# 16.100 μs (7 allocations: 78.41 KiB)

Another question to ask yourself is, do you really need a mutable, materialized array of random numbers? If not, you could return a view of the values in RandomGenerator instead of copying the values into dest.

Edit: just put together that you want to guarantee each subsample has a normal distribution. Headed to bed now, but I can think about this again tomorrow unless someone beats me to it!

1 Like

This has issues with global variables benchmarking, you avoid it with $-interpolation in the macro expression. And the function appears to not use its argument and uses a type-unstable global b, which I assume was a mistake. Is this what you intended?

a = zeros(10000)
@btime randn!($a) #
b = zeros(10)
foo(x) = for _ in 1:1000 randn!(x) end
@btime foo($b) #

Doesn’t seem to change the performance discrepancy much.

1 Like

Maybe StaticArrays.jl can be useful?

1 Like

Yes, you are right of course (that was a 4am kinda stupidity :grin:). But as you said, adding $ for the benchmark and a const keyword for the arrays doesn’t change too much.

I will look into your approach more deeply later; regarding your last question: Yes, a view would do it, I don’t need a materialized array.

That’s a nice approach, one issue I have is that Iterators.cycle(seed) starts again with the same sample. Therefore we only sample preallocation_size many values. I however want to sample infinitely many different values (abstracting from the fact that we only have a pseudorandomness).

Not Julia specific, but why do you need to sample each arm on every iteration? In the standard MAB, only a single arm’s random outcome is observed, so I would think you can sample just that one.

That’s a valid point. Firstly, if I sample only one value per iteration it is even more inperformant and a special case of our example. Secondly, some learning strategies sample one value per arm per iteration (Boltzmann-Gumbel exploration e.g.).

In other words: Our problem arises naturally :smiley:

After playing a bit around, I have modified your approach to a version with infinite samples:

mutable struct RandomGenerator{S, T}
    seed::S
    values::T
    function RandomGenerator(type, preallocation_size)
        seed = rand(type, preallocation_size)
        iterator = Iterators.Stateful(seed)
        new{typeof(seed), typeof(iterator)}(seed, iterator)
    end
end

function Random.rand!(rg::RandomGenerator, dest)
    vals = Iterators.take(rg.values, length(dest))
    if length(vals) != length(dest)
        rand!(rg.seed)
        rg.values = Iterators.Stateful(rg.seed)
        Random.rand!(rg, dest)
    else
        for (i, val) in zip(eachindex(dest), vals)
            @inbounds dest[i] = val
        end
        return dest
    end
end

The performance is slightly worse (18s vs 16s).

If (as you mentioned earlier) I just want to use these values in some calculation, I can use

function Random.rand(rg::RandomGenerator, dest_length::Integer)
    vals = Iterators.take(rg.values, dest_length)
    if length(vals) != dest_length
        rand!(rg.seed)
        rg.values = Iterators.Stateful(rg.seed)
        Random.rand(rg, dest_length)
    else
        return vals
    end
end

and then iterate over our return value. For example:

function preall_rand(dest_length, dest_n)
    rg = RandomGenerator(Float64, dest_length * dest_n)
    x = zeros(dest_length)
    for _ in 1:dest_n
        vals = rand(rg, dest_length)
        for (i, val) in zip(1:dest_length, vals)
            @inbounds x[i] = val + 1.0 # e.g. calculation with arbitrary +1
        end
    end
    return x
end

However, I noticed that if I use broadcasting via @inbounds x .= vals .+ 1.0, it gets much slower. Why is that? Can I avoid that? Should I somehow use view?

Edit: I see now why broadcasting is slow: iterables get collected. That is unfortunate, since I would like to use broadcasting in my user interface.

1 Like

I think I will stick with this solution using view for now:

mutable struct RandomGenerator{S, T}
    count::S
    values::T
    RandomGenerator(type, preallocation_size) = begin
        values = rand(type, preallocation_size)
        new{type, typeof(values)}(1, values)
    end
end

function Random.rand(rg::RandomGenerator, len::Integer)
    lower = rg.count
    upper = lower + len - 1
    if upper > length(rg.values)
        @assert (length(rg.values) >= len) "your requested length is too long"
        Random.rand!(rg.values)
        rg.count = 1
        Random.rand(rg, len)
    else
        vals = view(rg.values, lower:upper)
        rg.count = upper + 1
        return vals
    end
end

Thanks for the numerous suggestions and tips!

You have a typo. You input a but modify the global variable b.

1 Like

Thanks @DNF, I will just leave it as is to keep this thread readable.

This line in RandomGenerator constructor should be amended since 1 is not necessarily a typeof(values).

1 Like