CUDA.jl v3.0

No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback.

We do have a pure-Julia implementation of Philox in Random123.jl. Could we run (a modified version) of that, maybe? Then we could also offer the same RNG on GPU and CPU. It’s a counter-based RNG, so fully parallel.

Update: Fixed wrong link to Random123.jl