I’ve just release CUDA.jl v3.0, a slightly-breaking release with a lot of new features. You can read all about it on the JuliaGPU blog: CUDA.jl 3.0 ⋅ JuliaGPU.
A summary of the new features:
task-based concurrency: it is now possible to perform independent operations (or use different devices) on different Julia tasks, and expect the execution of those tasks to overlap.
memory allocator improvements: on recent NVIDIA drivers supporting CUDA 11.2, we now use the CUDA stream-ordered allocator instead of caching memory. This reduces memory pressure.
device overrides: it is no longer required to use CUDA-specific versions of incompatible methods like Base.sin, instead GPUCompiler.jl can override these automatically.
device-side random numbers: you can now call rand() in a kernel. The device-side RNG is pretty fast, but quality of generated numbers is subpar (help is wanted here).
revamped CUDNN interface: the library wrappers have been completely reworked to make it easier to use advanced features. The high-level wrappers have been moved to the NNlib.jl repo.
No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback. Furthermore, we can’t easily use the CURAND device API since it’s exposed via C++ headers.
No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback.
We do have a pure-Julia implementation of Philox in Random123.jl. Could we run (a modified version) of that, maybe? Then we could also offer the same RNG on GPU and CPU. It’s a counter-based RNG, so fully parallel.
The tricky part though is that not every thread can afford to have its own RNG state
But with a counter-based RNG, the state is effectively just the seed and the current counter value, so 2+4 64-bit UInts, in the case of Philox4x. And one can partition the counter space between the threads, so everything can be initialized from a one common seed. The threads then just start in different points in the vast (4x 64-bit) counter space (by block/thread/…-id, so that scheduling order won’t matter).
That’s still too much state for each thread (if you store it globally you get synchronizing memory accesses, and with block-shared memory we can’t afford to store that much). But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right? So then it should be possible to use this, I’ll have a look!
But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right?
Yes, indeed - knowing only the seed, it’d ID and the number of random numbers it has produced already (so a thread-specific counter), the state of the counter-based RNG (e.g. Philox) could be produced on-the-fly. From what I’ve seen in the commit you linked, this is similar to what you do now.
Philox&friends are well tested (nice overview in https://arxiv.org/pdf/1204.6193.pdf) and provide very high quality random number (they actually perform better than Mersenne Twister in some testing categories).
Very excited to see that one can call rand inside a CUDA kernel now, thanks for the great work!
I have been using Philox in curand to do large scale Brownian dynamics simulations. If we could have similar RNGs in CUDA.jl, I would be able to do everything entirely in julia.