Hi all,
We’re releasing new versions of some crucial packages that make up the Julia CUDA GPU stack today: CUDAdrv.jl 2.0, CUDAnative.jl 2.0 and CuArrays.jl 1.0 (following semver rules for breaking releases). The main addition is a GPU-specific pointer type to prevent erroneous conversions, but that’s a breaking change so please read along.
GPU-specific pointers
CUDAdrv now provides a new pointer type, CuPtr
, that is ABI-compatible with Ptr
but refuses to convert to and from these CPU pointers (as that would like crash Julia or CUDA somewhere down the road). This should avoid, or at least detect issues a la Calling a gpu model with a cpu array crashes julia · Issue #581 · FluxML/Flux.jl · GitHub. The new pointer type is used for buffers that are allocated by CUDAdrv. If you have low-level GPU code that, e.g., calls precompiled kernels, you will need to adjust your cudacall
signatures to use CuPtr
instead of regular Ptr
.
CuArrays.jl has already been adapted to use these pointers. The changes are pretty mechanical, see for example Adapt CUBLAS. · JuliaGPU/CuArrays.jl@4652c11 · GitHub. However, test coverage of CuArrays is far from perfect, so it’s possible we’ve made mistakes without noticing. If your code breaks with errors like cannot convert a GPU pointer to a CPU pointer
, please file an issue.
For ambiguous APIs that support both CPU and GPU pointers (such as parts of CUBLAS), there’s also a PtrOrCuPtr
type.
Pooling allocator performance improvements
We now rely less on running Julia’s full GC collector, which should improve performance for some workloads.
However, the underlying problem remains: we don’t know when GPU buffers are available for reuse until the Julia GC (which doesn’t know about the GPU’s memory pressure) has kicked in. To work around this issue, you can now do an explicit CuArrays.unsafe_free!
of a CuArray which marks it as available for reuse by the pooling allocator.
If you have a test case that performs badly, you can now have CuArrays print GC timings by calling CuArrays.pool_timings()
(building on @kristoffer.carlsson’s excellent TimerOutputs.jl):
$ julia -L application.jl -e "CuArrays.pool_timings()"
──────────────────────────────────────────────
Time
──────────────────────
Tot / % measured: 12.1s / 0.40%
Section ncalls time %tot avg
──────────────────────────────────────────────
pooled alloc 3 48.5ms 100% 16.2ms
1 try alloc 3 48.5ms 100% 16.2ms
background task 1 7.66ms 15.8% 7.66ms
scan 1 1.53μs 0.00% 1.53μs
reclaim 1 1.44μs 0.00% 1.44μs
──────────────────────────────────────────────
This information should be useful to further optimize the allocator.
Minor improvements
CuArrays.jl:
- initial support for logical indexing,
filter
andaccumulate
by @dpsanders - initial low-level wrappers for cublasxt by @kslimes
CUDAnative.jl:
- fixes for use of StaticArrays.jl within kernels (requires Julia master/1.2)
- support for
threadfence
,clock
andnanosleep
by @vchuravy - support for predicated synchronization (
sync_threads_count
,sync_threads_and
,sync_threads_or
) by @qin-yu - support for the CUDA device runtime by @vchuravy (paving the way for dynamic parallelism!)