I’m trying to understand the use of random numbers inside CUDA kerrnels in Julia. Below I have a simple script to estimate pi, but I have a few questions regarding its implementation.
Is my implementation of random numbers inside the mc_pi_kernel! even correct? The code runs, but if I try to call x = CUDA.rand(Float32, 1) the code crashes with an unsupported call error and I’m not sure why.
I’ve tried using the Random123.jl library to use a counter-based RNG but this also gets anunsupported dynamic function invocation error.
What am I doing wrong here?
Here’s the minimal reproducible example,
using CUDA
using Random123 # CUDA counter-based PRNG?
function mc_pi_kernel!(results::CuDeviceVector{Int32}, N::Int)
tid = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if tid < 1 || tid > N
return
end
x = rand(Float32) # Does this call CPU rand host-side?
y = rand(Float32)
results[tid] = (x*x + y*y <= 1.0f0) ? 1 : 0
return
end
function estimate_pi(N::Int=10^6)
d_results = CuArray(zeros(Int32, N))
# Get configuration threads/blocks
kernel = @cuda launch=false mc_pi_kernel!(d_results, N)
config = launch_configuration(kernel.fun)
threads = min(N, config.threads)
blocks = cld(N, threads)
println("Estimating π with $N samples ($blocks blocks, $threads threads)")
@cuda always_inline=true threads=threads blocks=blocks mc_pi_kernel!(d_results, N)
inside = sum(Array(d_results))
return 4 * inside / N
end
N = 10^9
pi_est = estimate_pi(N)
println("Estimated π with $N samples = $pi_est")
Yes, this looks fine. It might be slightly better to use a 32-bit literals (1i32 after using CUDA: i32), and potentially ifelse might be faster than the ternary operator. But I can’t measure any difference, so it probably doesn’t really matter here.
The reason is that this would create a CuVector, i.e. allocate memory, which is not allowed inside of kernels (or at least not in this manner). While these error messages are often hard to interpret, here it does explicitly mention allocating memory:
ERROR: InvalidIRError: (...)
Reason: unsupported call to an unknown function (call to jl_alloc_genericmemory)
Stacktrace:
[1] GenericMemory
In contrast, rand(Float32) returns a simple scalar. Inside of a kernel, this will automatically reside on the device.
I’m not familiar with this library, but presumably it allocates, is type-unstable, or uses non-isbits structs which are not adapted for the GPU.
By the way, in
you can just use sum(d_results), which will then perform the summation on the GPU, and return the scalar result on the CPU.
I think the main issue is that something like Philox2x is a mutable struct (hence allocates when you create it).
You could probably rewrite the Random123.jl code to make it immutable, replacing all mutating functions on the path of rand by versions (also) returning a new Philox2x. In particular, you would then need to use r = Philox2x(); x, r = rand(r). You should also make sure that every thread uses a different seed. But that all sounds like more work than it’s worth .