Potential issues with implementation of ziggurat algorithm

MrBelette · November 26, 2025, 2:03am

I’m seeing intermittent failures to finish issues when repeatedly running a kernel that calls randn many times.
I’m wondering if someone can independently verify this behaviour, so I can decide if a bug needs to be reported?
Or perhaps I’m misusing the CUDA.jl library?

test code:

using CUDA
using CUDA: i32

function test_baserandn(nsims, res)
    tid = threadIdx().x + (blockIdx().x-1i32) * blockDim().x 
    if tid <= nsims
        X = 0.0
        for mth = 1i32:1200i32
            X += randn()
        end 
        @inbounds res[tid] = X / 1200
    end
    nothing
end
function run_test_baserandn(numsims)
    nthreads = 256
    nblocks = cld(numsims, nthreads)
    res = CuArray{Float64}(zeros(numsims))
    @cuda threads=nthreads blocks=nblocks test_baserandn(Int32(numsims), res)
    println(sum(res)/numsims)
    nothing
end
@time CUDA.@sync run_test_baserandn(100000)

It occasionally runs fine - here’s a profile:

But another run might never finish and I have to force the kernel to exit. This appears increasingly likely after increasing the number of normal RVs that are generated.

I don’t appear to have this problem if I remove @inbounds from the code in randn_unlikely function:

using Random

@noinline function CUDA.randn_unlikely(rng, idx, rabs, x)
    # @inbounds begin
        if idx == 0
            while true
                xx = -Random.ziggurat_nor_inv_r*log(Random.rand(rng))
                yy = -log(Random.rand(rng))
                yy+yy > xx*xx &&
                    return (rabs >> 8) % Bool ? -Random.ziggurat_nor_r-xx : Random.ziggurat_nor_r+xx
            end
        elseif (CUDA.gpu_fi()[idx] - CUDA.gpu_fi()[idx+1])*Random.rand(rng) + CUDA.gpu_fi()[idx+1] < exp(-0.5*x*x)
            return x # return from the triangular area
        else
            return Random.randn(rng)
        end
    # end
end

or if I replace the final recursive call to Random.randn(rng) with the Box-Mueller algorithm.

As an aside, the gpu implementation implementation of randn_unlikely is slightly different to the stdlib cpu version (note the use of log1p and the change of sign of the uniform rvs). Not sure if this is another issue?. Stdlib version:

@noinline function randn_unlikely(rng, idx, rabs, x)
    @inbounds if idx == 0
        while true
            xx = -ziggurat_nor_inv_r*log1p(-rand(rng))
            yy = -log1p(-rand(rng))
            yy+yy > xx*xx &&
                return (rabs >> 8) % Bool ? -ziggurat_nor_r-xx : ziggurat_nor_r+xx
        end
    elseif (fi[idx] - fi[idx+1])*rand(rng) + fi[idx+1] < exp(-0.5*x*x)
        return x # return from the triangular area
    else
        return randn(rng)
    end
end

Topic		Replies	Views
Why is GPU kernel rand() not as "random" as CPU rand()? GPU question , cuda , kernel	10	578	May 17, 2023
GPU randn way slower than rand? Performance gpu , cuda	6	1617	December 3, 2018
Random numbers in [0.f0,1.f0] GPU	2	701	November 10, 2019
Generating Random Number from inside Kernel GPU	12	5015	January 2, 2018
Kernel random numbers generation entropy / randomness issues GPU question , kernel	25	1354	November 3, 2023

Potential issues with implementation of ziggurat algorithm

Related topics