Strange issues with stride loops as in CUDA.Random

Hello everyone,
I’ve been spending quite some time debugging my method to sample gamma-distributed random numbers on my GPU.
The algo is quite simple, as my shape parameter (k) is drawn from a Poisson distribution and is therefore an integer, it should suffice to add the negative log of k unif(0,1) random numbers:

function gamma_rho_dep!(rng::CUDA.RNG, rho)

    # The GPU kernel
    function kernel!(rho, seed::UInt32, counter::UInt32)
        device_rng = Random.default_rng()
        # initialize the state
        @inbounds Random.seed!(device_rng, seed, counter)


        # grid-stride loop
        tid    = threadIdx().x
        window = (blockDim().x - 1i32) * gridDim().x
        offset = (blockIdx().x - 1i32) * blockDim().x

        while offset < length(rho)
            
            i = tid + offset
            if i <= length(rho)
                @inbounds k = round(UInt32,rho[i])
                # if k != 1
                #     @cuprintln(i, ";", k)
                # end
                @inbounds rho[i] = 0.0f0
                for j=1:k
                    @inbounds rho[i] -= CUDA.log(Random.rand(device_rng, FloatType))
                end
            end
            offset += window
        end
        return nothing
    end

    kernel = @cuda launch=false kernel!(rho, rng.seed, rng.counter)
    config = CUDA.launch_configuration(kernel.fun; max_threads=64)
    threads = max(32, min(length(rho),config.threads))
    blocks = min(config.blocks,cld(length(rho), threads))
    CUDA.@sync kernel(rho, rng.seed, rng.counter; threads, blocks)

    new_counter = Int64(rng.counter) + length(rho)
    overflow, remainder = fldmod(new_counter, typemax(UInt32))
    rng.seed += overflow     
    rng.counter = remainder
    rho
end

The shape parameter is supposed to be different for each element of the rho array, that’s why I have written this kernel for it. Things get a bit weird, because the integers come as Float32 for the simple reason that I do not want two arrays of gigantic size and different types - maybe I need to re-cast the CuArray to a different type? But I do not expect any real issues to arise from this, as some slight fluctuations around the integer types should not be noticed by the subsequent round(UInt32), right?
Now, when testing this, I tried to first invoke it with the simplest thing possible - CUDA.ones(Float32,n). This should reproduce a Gamma distribution with shape parameter 1 (which should be an exponential distribution?). It does, except I also get a huge peak of zeros in addition. I tracked this to the fact that the round()-function returns something other than one. It gets worse - it does so on arrays of sizes 10000 and upwards, exclusively for about 6% of the array, and only at the end of the array! Not every value towards the end is wrong, however… What is happening here???

Also, this works:

function test_ones(len::Integer)

    # The GPU kernel
    function kernel!(A, B)

        # grid-stride loop
        tid    = threadIdx().x
        window = (blockDim().x - 1i32) * gridDim().x
        offset = (blockIdx().x - 1i32) * blockDim().x

        while offset < length(A)
            i = tid + offset
            if i <= length(A)
                @inbounds B[i] = round(UInt32,A[i])
            end
            offset += window
        end
        return nothing
    end
    
    CUDA.@sync a = CUDA.ones(FloatType, len)
    b = CuArray{UInt32}(undef, len)
    kernel = @cuda launch=false kernel!(a, b)
    config = CUDA.launch_configuration(kernel.fun; max_threads=64)
    threads = max(32, min(length(a),config.threads))
    blocks = cld(length(a), threads)
    CUDA.@sync kernel(a, b; threads, blocks)
    b

end

It differs only by the fact that the numbers are written to an array of UInt32s instead of the hopefully thread-local variable k…

EDIT:
Somehow, the array elements seem to be visited twice, but I don’t really see how. The grid-stride-loop must be wrong, but I shamelessly stole it from CUDA.jl, to be specific the Random subpackage

I have also been looking at this issue, but since it happens from my usual terminal as well, I suspect it to be something different.

My versions:

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_EDITOR = code

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.129.6, for CUDA 11.4
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+470.129.6
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.0
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA T1000 (sm_75, 2.645 GiB / 3.817 GiB available)

Tracked it down…

        # grid-stride loop
        tid    = threadIdx().x
        window = (blockDim().x - 1i32) * gridDim().x
        offset = (blockIdx().x - 1i32) * blockDim().x

Subtracting the one from blockDim().x seems to be wrong.
The thing is that I adapted the code from BinomialGPU.jl which copied from CUDA.jl Can anybody confirm that this is a bug in those packages? Or is something else wrong?