CUDA.jl v3.0

Hi all,

I’ve just release CUDA.jl v3.0, a slightly-breaking release with a lot of new features. You can read all about it on the JuliaGPU blog: CUDA.jl 3.0 ⋅ JuliaGPU.

A summary of the new features:

  • task-based concurrency: it is now possible to perform independent operations (or use different devices) on different Julia tasks, and expect the execution of those tasks to overlap.
  • memory allocator improvements: on recent NVIDIA drivers supporting CUDA 11.2, we now use the CUDA stream-ordered allocator instead of caching memory. This reduces memory pressure.
  • device overrides: it is no longer required to use CUDA-specific versions of incompatible methods like Base.sin, instead GPUCompiler.jl can override these automatically.
  • device-side random numbers: you can now call rand() in a kernel. The device-side RNG is pretty fast, but quality of generated numbers is subpar (help is wanted here).
  • revamped CUDNN interface: the library wrappers have been completely reworked to make it easier to use advanced features. The high-level wrappers have been moved to the NNlib.jl repo.
42 Likes

hooray! thank you!

1 Like

Which RNG algorithm is that, currently, something from cuRAND? I think there’s Philox4x in there, that one has a good reputation.

No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback. Furthermore, we can’t easily use the CURAND device API since it’s exposed via C++ headers.

Thanks for a great release! The memory allocator improvements result in a 20-30% speedup for AlphaZero.jl on the Connect Four benchmark. :slight_smile:

10 Likes

No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback.

We do have a pure-Julia implementation of Philox in Random123.jl. Could we run (a modified version) of that, maybe? Then we could also offer the same RNG on GPU and CPU. It’s a counter-based RNG, so fully parallel.

Update: Fixed wrong link to Random123.jl

The tricky part though is that not every thread can afford to have its own RNG state, and if you put it in block-shared memory you need to take care that the order in which threads are scheduled within a block does not affect the numbers that are generated (from a single thread’s point of view, that is). See https://github.com/JuliaGPU/CUDA.jl/blob/27520e60d65e08e50e8c6ce30b2ec322d0fdecb8/src/device/random.jl#L71-L77, https://github.com/JuliaGPU/CUDA.jl/blob/27520e60d65e08e50e8c6ce30b2ec322d0fdecb8/src/device/random.jl#L161-L224.

The tricky part though is that not every thread can afford to have its own RNG state

But with a counter-based RNG, the state is effectively just the seed and the current counter value, so 2+4 64-bit UInts, in the case of Philox4x. And one can partition the counter space between the threads, so everything can be initialized from a one common seed. The threads then just start in different points in the vast (4x 64-bit) counter space (by block/thread/…-id, so that scheduling order won’t matter).

That’s still too much state for each thread (if you store it globally you get synchronizing memory accesses, and with block-shared memory we can’t afford to store that much). But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right? So then it should be possible to use this, I’ll have a look!

But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right?

Yes, indeed - knowing only the seed, it’d ID and the number of random numbers it has produced already (so a thread-specific counter), the state of the counter-based RNG (e.g. Philox) could be produced on-the-fly. From what I’ve seen in the commit you linked, this is similar to what you do now.

Philox&friends are well tested (nice overview in https://arxiv.org/pdf/1204.6193.pdf) and provide very high quality random number (they actually perform better than Mersenne Twister in some testing categories).

1 Like

Very excited to see that one can call rand inside a CUDA kernel now, thanks for the great work!

I have been using Philox in curand to do large scale Brownian dynamics simulations. If we could have similar RNGs in CUDA.jl, I would be able to do everything entirely in julia.

1 Like