Strange issues with stride loops as in CUDA.Random

physicsjoke · August 2, 2022, 12:27pm

Hello everyone,
I’ve been spending quite some time debugging my method to sample gamma-distributed random numbers on my GPU.
The algo is quite simple, as my shape parameter (k) is drawn from a Poisson distribution and is therefore an integer, it should suffice to add the negative log of k unif(0,1) random numbers:

function gamma_rho_dep!(rng::CUDA.RNG, rho)

    # The GPU kernel
    function kernel!(rho, seed::UInt32, counter::UInt32)
        device_rng = Random.default_rng()
        # initialize the state
        @inbounds Random.seed!(device_rng, seed, counter)


        # grid-stride loop
        tid    = threadIdx().x
        window = (blockDim().x - 1i32) * gridDim().x
        offset = (blockIdx().x - 1i32) * blockDim().x

        while offset < length(rho)
            
            i = tid + offset
            if i <= length(rho)
                @inbounds k = round(UInt32,rho[i])
                # if k != 1
                #     @cuprintln(i, ";", k)
                # end
                @inbounds rho[i] = 0.0f0
                for j=1:k
                    @inbounds rho[i] -= CUDA.log(Random.rand(device_rng, FloatType))
                end
            end
            offset += window
        end
        return nothing
    end

    kernel = @cuda launch=false kernel!(rho, rng.seed, rng.counter)
    config = CUDA.launch_configuration(kernel.fun; max_threads=64)
    threads = max(32, min(length(rho),config.threads))
    blocks = min(config.blocks,cld(length(rho), threads))
    CUDA.@sync kernel(rho, rng.seed, rng.counter; threads, blocks)

    new_counter = Int64(rng.counter) + length(rho)
    overflow, remainder = fldmod(new_counter, typemax(UInt32))
    rng.seed += overflow     
    rng.counter = remainder
    rho
end

The shape parameter is supposed to be different for each element of the rho array, that’s why I have written this kernel for it. Things get a bit weird, because the integers come as Float32 for the simple reason that I do not want two arrays of gigantic size and different types - maybe I need to re-cast the CuArray to a different type? But I do not expect any real issues to arise from this, as some slight fluctuations around the integer types should not be noticed by the subsequent round(UInt32), right?
Now, when testing this, I tried to first invoke it with the simplest thing possible - CUDA.ones(Float32,n). This should reproduce a Gamma distribution with shape parameter 1 (which should be an exponential distribution?). It does, except I also get a huge peak of zeros in addition. I tracked this to the fact that the round()-function returns something other than one. It gets worse - it does so on arrays of sizes 10000 and upwards, exclusively for about 6% of the array, and only at the end of the array! Not every value towards the end is wrong, however… What is happening here???

Also, this works:

function test_ones(len::Integer)

    # The GPU kernel
    function kernel!(A, B)

        # grid-stride loop
        tid    = threadIdx().x
        window = (blockDim().x - 1i32) * gridDim().x
        offset = (blockIdx().x - 1i32) * blockDim().x

        while offset < length(A)
            i = tid + offset
            if i <= length(A)
                @inbounds B[i] = round(UInt32,A[i])
            end
            offset += window
        end
        return nothing
    end
    
    CUDA.@sync a = CUDA.ones(FloatType, len)
    b = CuArray{UInt32}(undef, len)
    kernel = @cuda launch=false kernel!(a, b)
    config = CUDA.launch_configuration(kernel.fun; max_threads=64)
    threads = max(32, min(length(a),config.threads))
    blocks = cld(length(a), threads)
    CUDA.@sync kernel(a, b; threads, blocks)
    b

end

It differs only by the fact that the numbers are written to an array of UInt32s instead of the hopefully thread-local variable k…

EDIT:
Somehow, the array elements seem to be visited twice, but I don’t really see how. The grid-stride-loop must be wrong, but I shamelessly stole it from CUDA.jl, to be specific the Random subpackage

I have also been looking at this issue, but since it happens from my usual terminal as well, I suspect it to be something different.

My versions:

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_EDITOR = code

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.129.6, for CUDA 11.4
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+470.129.6
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.0
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA T1000 (sm_75, 2.645 GiB / 3.817 GiB available)

physicsjoke · August 3, 2022, 12:51pm

Tracked it down…

        # grid-stride loop
        tid    = threadIdx().x
        window = (blockDim().x - 1i32) * gridDim().x
        offset = (blockIdx().x - 1i32) * blockDim().x

Subtracting the one from blockDim().x seems to be wrong.
The thing is that I adapted the code from BinomialGPU.jl which copied from CUDA.jl Can anybody confirm that this is a bug in those packages? Or is something else wrong?

Topic		Replies	Views
Simple kernel not working GPU	10	1224	July 12, 2020
Generating Random Number from inside Kernel GPU	12	5007	January 2, 2018
Custom random sampling kernels GPU cuda	16	1710	July 26, 2022
Random numbers in CUDA GPU	3	3308	March 23, 2019
Gpu_rand inside @cuda GPU	2	386	November 18, 2020

Strange issues with stride loops as in CUDA.Random

Related topics