CUDA.jl v3.0

maleadt · April 9, 2021, 2:34pm

Hi all,

I’ve just release CUDA.jl v3.0, a slightly-breaking release with a lot of new features. You can read all about it on the JuliaGPU blog: CUDA.jl 3.0 ⋅ JuliaGPU.

A summary of the new features:

task-based concurrency: it is now possible to perform independent operations (or use different devices) on different Julia tasks, and expect the execution of those tasks to overlap.
memory allocator improvements: on recent NVIDIA drivers supporting CUDA 11.2, we now use the CUDA stream-ordered allocator instead of caching memory. This reduces memory pressure.
device overrides: it is no longer required to use CUDA-specific versions of incompatible methods like Base.sin, instead GPUCompiler.jl can override these automatically.
device-side random numbers: you can now call rand() in a kernel. The device-side RNG is pretty fast, but quality of generated numbers is subpar (help is wanted here).
revamped CUDNN interface: the library wrappers have been completely reworked to make it easier to use advanced features. The high-level wrappers have been moved to the NNlib.jl repo.

pjentsch0 · April 9, 2021, 3:23pm

hooray! thank you!

oschulz · April 9, 2021, 3:25pm

Which RNG algorithm is that, currently, something from cuRAND? I think there’s Philox4x in there, that one has a good reputation.

maleadt · April 9, 2021, 3:27pm

No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback. Furthermore, we can’t easily use the CURAND device API since it’s exposed via C++ headers.

jonathan-laurent · April 9, 2021, 3:38pm

Thanks for a great release! The memory allocator improvements result in a 20-30% speedup for AlphaZero.jl on the Connect Four benchmark.

oschulz · April 9, 2021, 3:47pm

No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback.

We do have a pure-Julia implementation of Philox in Random123.jl. Could we run (a modified version) of that, maybe? Then we could also offer the same RNG on GPU and CPU. It’s a counter-based RNG, so fully parallel.

Update: Fixed wrong link to Random123.jl

maleadt · April 9, 2021, 3:55pm

The tricky part though is that not every thread can afford to have its own RNG state, and if you put it in block-shared memory you need to take care that the order in which threads are scheduled within a block does not affect the numbers that are generated (from a single thread’s point of view, that is). See https://github.com/JuliaGPU/CUDA.jl/blob/27520e60d65e08e50e8c6ce30b2ec322d0fdecb8/src/device/random.jl#L71-L77, https://github.com/JuliaGPU/CUDA.jl/blob/27520e60d65e08e50e8c6ce30b2ec322d0fdecb8/src/device/random.jl#L161-L224.

oschulz · April 9, 2021, 4:59pm

The tricky part though is that not every thread can afford to have its own RNG state

But with a counter-based RNG, the state is effectively just the seed and the current counter value, so 2+4 64-bit UInts, in the case of Philox4x. And one can partition the counter space between the threads, so everything can be initialized from a one common seed. The threads then just start in different points in the vast (4x 64-bit) counter space (by block/thread/…-id, so that scheduling order won’t matter).

maleadt · April 9, 2021, 5:12pm

That’s still too much state for each thread (if you store it globally you get synchronizing memory accesses, and with block-shared memory we can’t afford to store that much). But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right? So then it should be possible to use this, I’ll have a look!

oschulz · April 9, 2021, 5:38pm

But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right?

Yes, indeed - knowing only the seed, it’d ID and the number of random numbers it has produced already (so a thread-specific counter), the state of the counter-based RNG (e.g. Philox) could be produced on-the-fly. From what I’ve seen in the commit you linked, this is similar to what you do now.

Philox&friends are well tested (nice overview in https://arxiv.org/pdf/1204.6193.pdf) and provide very high quality random number (they actually perform better than Mersenne Twister in some testing categories).

zhiwei · April 16, 2021, 10:26pm

Very excited to see that one can call rand inside a CUDA kernel now, thanks for the great work!

I have been using Philox in curand to do large scale Brownian dynamics simulations. If we could have similar RNGs in CUDA.jl, I would be able to do everything entirely in julia.

Topic		Replies	Views
Kernel random numbers generation entropy / randomness issues GPU question , kernel	25	1240	November 3, 2023
How to generate a random number in CUDA kernel function GPU question	4	3574	November 20, 2020
Device RNG - Passes BigCrush GPU	21	544	October 18, 2024
Random numbers in KernelAbstractions GPU question , kernelabstractions	4	1080	February 15, 2022
Why is GPU kernel rand() not as "random" as CPU rand()? GPU question , cuda , kernel	10	511	May 17, 2023

CUDA.jl v3.0

Related topics