Potential issues with implementation of ziggurat algorithm

I’m seeing intermittent failures to finish issues when repeatedly running a kernel that calls randn many times.
I’m wondering if someone can independently verify this behaviour, so I can decide if a bug needs to be reported?
Or perhaps I’m misusing the CUDA.jl library?

test code:

using CUDA
using CUDA: i32

function test_baserandn(nsims, res)
    tid = threadIdx().x + (blockIdx().x-1i32) * blockDim().x 
    if tid <= nsims
        X = 0.0
        for mth = 1i32:1200i32
            X += randn()
        end 
        @inbounds res[tid] = X / 1200
    end
    nothing
end
function run_test_baserandn(numsims)
    nthreads = 256
    nblocks = cld(numsims, nthreads)
    res = CuArray{Float64}(zeros(numsims))
    @cuda threads=nthreads blocks=nblocks test_baserandn(Int32(numsims), res)
    println(sum(res)/numsims)
    nothing
end
@time CUDA.@sync run_test_baserandn(100000)

It occasionally runs fine - here’s a profile:

But another run might never finish and I have to force the kernel to exit. This appears increasingly likely after increasing the number of normal RVs that are generated.

I don’t appear to have this problem if I remove @inbounds from the code in randn_unlikely function:

using Random

@noinline function CUDA.randn_unlikely(rng, idx, rabs, x)
    # @inbounds begin
        if idx == 0
            while true
                xx = -Random.ziggurat_nor_inv_r*log(Random.rand(rng))
                yy = -log(Random.rand(rng))
                yy+yy > xx*xx &&
                    return (rabs >> 8) % Bool ? -Random.ziggurat_nor_r-xx : Random.ziggurat_nor_r+xx
            end
        elseif (CUDA.gpu_fi()[idx] - CUDA.gpu_fi()[idx+1])*Random.rand(rng) + CUDA.gpu_fi()[idx+1] < exp(-0.5*x*x)
            return x # return from the triangular area
        else
            return Random.randn(rng)
        end
    # end
end

or if I replace the final recursive call to Random.randn(rng) with the Box-Mueller algorithm.

As an aside, the gpu implementation implementation of randn_unlikely is slightly different to the stdlib cpu version (note the use of log1p and the change of sign of the uniform rvs). Not sure if this is another issue?. Stdlib version:

@noinline function randn_unlikely(rng, idx, rabs, x)
    @inbounds if idx == 0
        while true
            xx = -ziggurat_nor_inv_r*log1p(-rand(rng))
            yy = -log1p(-rand(rng))
            yy+yy > xx*xx &&
                return (rabs >> 8) % Bool ? -ziggurat_nor_r-xx : ziggurat_nor_r+xx
        end
    elseif (fi[idx] - fi[idx+1])*rand(rng) + fi[idx+1] < exp(-0.5*x*x)
        return x # return from the triangular area
    else
        return randn(rng)
    end
end