Performance of generating uniform Float32 numbers

I’m trying to generate Float32’s from a uniformly from a range, and found Normal from distributions doesn’t offer great performance compared to just stretching and shifting what rand(Float32) already produces. Am I doing something wrong?

mutable struct RangeFloats
   rangebegin::Float32
   rangelength::Float32
end

mutable struct RangeDistribution
   d::Uniform{Float32}
end

@inline function generate_uniform(rf::RangeFloats)
   return rand(Float32)*rf.rangelength + rf.rangebegin
end

function sampleuniform(rf::RangeFloats)
    cum = 0f0
    for _ in 1:10000
        cum += generate_uniform(rf)
    end
    return cum
end

function sampleuniform(rd::RangeDistribution)
    cum = 0f0
    for _ in 1:10000
        cum += Float32(rand(rd.d))
    end
    return cum
end

Here I find sampleuniform(::RangeFloats) has better performance.

It looks like rand(::Uniform) from Distributions.jl is slower for two reasons:

  1. It always returns a Float64 even if the endpoints are Float32, which causes the random-number generation and the arithmetic to be more expensive than necessary. This explains about 2/3 of the peformance difference on my machine (as can be verified by manually implementing a generate_uniform(u::Uniform{T}) where {T} = rand(float(real(T)))*(u.b - u.a) + u.a method and calling that instead of rand). It is a known issue that was reported recently: Inconsistent Type Behavior for Uniform Distribution · Issue #1783 · JuliaStats/Distributions.jl · GitHub

  2. Uniform stores the endpoints [a,b), so it has to compute b-a on each call. This explains the remaining 1/3 of the cost difference on my machine.

6 Likes

Thanks for the reply!

Yes I looked around and saw indeed that rand(::Uniform{Float32}) returning a Float64 has been a longstanding problem. Wouldn’t this easily be solved in the current implementation by just adding a method rand(rng::AbstractRNG, d::Uniform{T}) where T = d.a + (d.b - d.a) * rand(rng, T). For the second it might be useful to either just store the length instead of the endpoint, or add an extra field that is computed on construction?

These seem like very obvious fixes to me, so I’m assuming these have been considered already. Thus I also assume the situation is a bit more complicated than I think. I’d be interested to know why this has not been ‘fixed’ yet.

The reason is almost invariably lack of time. If you feel like making a PR, I think both the additional method as well as the cache would make sense, as it could also be used in other methods. Otherwise I can give it a shot later.

3 Likes

I made a PR here.

3 Likes