I suppose people would be interested in this for reproducibility in algorithms that comprise multiple kernel invocations. Would it be possible to hold the seed and counter in a task-local host-side state so that the device-side RNG by default works the way it currently does when used through the host-side wrapper? (I seem to recall that CUDA.jl identifies CUDA streams with Julia tasks, so by making this state task-local you won’t have to deal with race conditions from launching kernels on parallel CUDA streams.)
Though on the point of reproducibility, is the mapping of the random number stream to threads/lanes deterministic? I.e., will a kernel where I set the seed and counter at the top and then do random stuff always produce the same results (for a given block/thread structure)?
julia> using CUDA, Random
julia> function kernel_myrand!(x)
Random.seed!(93684) # NEVER DO THIS IF THE KERNEL IS CALLED MORE THAN ONCE
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for i = index:stride:length(x)
x[i] = rand(Float32)
end
return nothing
end;
julia> begin # same result every time
x, y = CUDA.zeros(2^16), CUDA.zeros(2^16)
CUDA.@sync begin
@cuda threads=128 blocks=8 kernel_myrand!(x)
@cuda threads=128 blocks=8 kernel_myrand!(y)
end
x == y
end
true
For the counter, that’s hard, because it would imply that we have to read back the value from the kernel after it finishes (which requires synchronization), given that we don’t know how many times rand() was called. Furthermore, the counter could differ between warps, if rand() wasn’t called an equal amount of times across threads (although I’m not sure the implementation is robust enough to allow that). So we would have to read back the counter for every warp.
The alternative is to store the per-warp counters in global memory instead, but that would lower performance, as global memory accesses are synchronizing. We currently store them in shared memory to avoid such synchronization.
Thanks for explaining! Also, I get it now, if you set the seed of the CPU RNG at the top of your code, the random numbers are indeed reproducible as long as both the CPU and GPU code stay they same, even though they don’t correspond to a single coherent stream of the device RNG.
Maybe this deserves a paragraph of documentation somewhere, explaining that the way to ensure end-to-end reproducibility is simply to seed the CPU RNG as usual once at the top of your program (or once for each task if you’re using host-side concurrency), and that seeding within kernels is something you probably shouldn’t worry about unless you know exactly what you’re doing.
Thank you for your help, especially @maleadt and @danielwe for all the input about how seeding RGN works, it was both thorough and enlightening!
I would like to also express my sincere appreciation for the outstanding dedication and effort that @maleadt puts into developing and enhancing CUDA.jl, especially. Your tireless commitment and promptly responses to community posts are invaluable and greatly benefited the community as a whole.
Keep up your exceptional work and unwavering support!
Based on some of the questions in this thread, I added some documentation on kernel programming (long overdue…), including a short section on device-side RNG usage: Kernel programming · CUDA.jl. Feel free to edit if it can be improved.