Understanding random numbers in a GPU kernel

Hi All,

I’m trying to understand the use of random numbers inside CUDA kerrnels in Julia. Below I have a simple script to estimate pi, but I have a few questions regarding its implementation.

  1. Is my implementation of random numbers inside the mc_pi_kernel! even correct? The code runs, but if I try to call x = CUDA.rand(Float32, 1) the code crashes with an unsupported call error and I’m not sure why.
  2. I’ve tried using the Random123.jl library to use a counter-based RNG but this also gets anunsupported dynamic function invocation error.

What am I doing wrong here?

Here’s the minimal reproducible example,

using CUDA
using Random123 # CUDA counter-based PRNG? 

function mc_pi_kernel!(results::CuDeviceVector{Int32}, N::Int)
    tid = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if tid < 1 || tid > N
        return
    end

    x = rand(Float32) # Does this call CPU rand host-side? 
    y = rand(Float32)

    results[tid] = (x*x + y*y <= 1.0f0) ? 1 : 0

    return
end

function estimate_pi(N::Int=10^6)
    d_results = CuArray(zeros(Int32, N))

    # Get configuration threads/blocks 
    kernel = @cuda launch=false mc_pi_kernel!(d_results, N)
    config = launch_configuration(kernel.fun)
    threads = min(N, config.threads)
    blocks = cld(N, threads)
    println("Estimating π with $N samples ($blocks blocks, $threads threads)")

    @cuda always_inline=true threads=threads blocks=blocks mc_pi_kernel!(d_results, N)

    inside = sum(Array(d_results))
    return 4 * inside / N
end

N = 10^9
pi_est = estimate_pi(N)
println("Estimated π with $N samples = $pi_est")
1 Like

Hi,

Yes, this looks fine. It might be slightly better to use a 32-bit literals (1i32 after using CUDA: i32), and potentially ifelse might be faster than the ternary operator. But I can’t measure any difference, so it probably doesn’t really matter here.

The reason is that this would create a CuVector, i.e. allocate memory, which is not allowed inside of kernels (or at least not in this manner). While these error messages are often hard to interpret, here it does explicitly mention allocating memory:

ERROR: InvalidIRError: (...)
Reason: unsupported call to an unknown function (call to jl_alloc_genericmemory)
Stacktrace:
 [1] GenericMemory

In contrast, rand(Float32) returns a simple scalar. Inside of a kernel, this will automatically reside on the device.

I’m not familiar with this library, but presumably it allocates, is type-unstable, or uses non-isbits structs which are not adapted for the GPU.

By the way, in

you can just use sum(d_results), which will then perform the summation on the GPU, and return the scalar result on the CPU.

2 Likes

Thanks for the detailed explaination @eldee!

If anyone else knows how to use the Random123.jl library to use counter-based RNGs in Julia, do let me know!