We’re releasing new versions of some crucial packages that make up the Julia CUDA GPU stack today: CUDAdrv.jl 2.0, CUDAnative.jl 2.0 and CuArrays.jl 1.0 (following semver rules for breaking releases). The main addition is a GPU-specific pointer type to prevent erroneous conversions, but that’s a breaking change so please read along.
CUDAdrv now provides a new pointer type,
CuPtr, that is ABI-compatible with
Ptr but refuses to convert to and from these CPU pointers (as that would like crash Julia or CUDA somewhere down the road). This should avoid, or at least detect issues a la Calling a gpu model with a cpu array crashes julia · Issue #581 · FluxML/Flux.jl · GitHub. The new pointer type is used for buffers that are allocated by CUDAdrv. If you have low-level GPU code that, e.g., calls precompiled kernels, you will need to adjust your
cudacall signatures to use
CuPtr instead of regular
CuArrays.jl has already been adapted to use these pointers. The changes are pretty mechanical, see for example Adapt CUBLAS. · JuliaGPU/CuArrays.jl@4652c11 · GitHub. However, test coverage of CuArrays is far from perfect, so it’s possible we’ve made mistakes without noticing. If your code breaks with errors like
cannot convert a GPU pointer to a CPU pointer, please file an issue.
For ambiguous APIs that support both CPU and GPU pointers (such as parts of CUBLAS), there’s also a
Pooling allocator performance improvements
We now rely less on running Julia’s full GC collector, which should improve performance for some workloads.
However, the underlying problem remains: we don’t know when GPU buffers are available for reuse until the Julia GC (which doesn’t know about the GPU’s memory pressure) has kicked in. To work around this issue, you can now do an explicit
CuArrays.unsafe_free! of a CuArray which marks it as available for reuse by the pooling allocator.
If you have a test case that performs badly, you can now have CuArrays print GC timings by calling
CuArrays.pool_timings() (building on @kristoffer.carlsson’s excellent TimerOutputs.jl):
$ julia -L application.jl -e "CuArrays.pool_timings()" ────────────────────────────────────────────── Time ────────────────────── Tot / % measured: 12.1s / 0.40% Section ncalls time %tot avg ────────────────────────────────────────────── pooled alloc 3 48.5ms 100% 16.2ms 1 try alloc 3 48.5ms 100% 16.2ms background task 1 7.66ms 15.8% 7.66ms scan 1 1.53μs 0.00% 1.53μs reclaim 1 1.44μs 0.00% 1.44μs ──────────────────────────────────────────────
This information should be useful to further optimize the allocator.
- initial support for logical indexing,
- initial low-level wrappers for cublasxt by @kslimes
- fixes for use of StaticArrays.jl within kernels (requires Julia master/1.2)
- support for
- support for predicated synchronization (
sync_threads_or) by @qin-yu
- support for the CUDA device runtime by @vchuravy (paving the way for dynamic parallelism!)