Release: CUDAdrv/CUDAnative 2.0, CuArrays 1.0

Hi all,

We’re releasing new versions of some crucial packages that make up the Julia CUDA GPU stack today: CUDAdrv.jl 2.0, CUDAnative.jl 2.0 and CuArrays.jl 1.0 (following semver rules for breaking releases). The main addition is a GPU-specific pointer type to prevent erroneous conversions, but that’s a breaking change so please read along.

GPU-specific pointers

CUDAdrv now provides a new pointer type, CuPtr, that is ABI-compatible with Ptr but refuses to convert to and from these CPU pointers (as that would like crash Julia or CUDA somewhere down the road). This should avoid, or at least detect issues a la Calling a gpu model with a cpu array crashes julia · Issue #581 · FluxML/Flux.jl · GitHub. The new pointer type is used for buffers that are allocated by CUDAdrv. If you have low-level GPU code that, e.g., calls precompiled kernels, you will need to adjust your cudacall signatures to use CuPtr instead of regular Ptr.

CuArrays.jl has already been adapted to use these pointers. The changes are pretty mechanical, see for example Adapt CUBLAS. · JuliaGPU/CuArrays.jl@4652c11 · GitHub. However, test coverage of CuArrays is far from perfect, so it’s possible we’ve made mistakes without noticing. If your code breaks with errors like cannot convert a GPU pointer to a CPU pointer, please file an issue.

For ambiguous APIs that support both CPU and GPU pointers (such as parts of CUBLAS), there’s also a PtrOrCuPtr type.

Pooling allocator performance improvements

We now rely less on running Julia’s full GC collector, which should improve performance for some workloads.

However, the underlying problem remains: we don’t know when GPU buffers are available for reuse until the Julia GC (which doesn’t know about the GPU’s memory pressure) has kicked in. To work around this issue, you can now do an explicit CuArrays.unsafe_free! of a CuArray which marks it as available for reuse by the pooling allocator.

If you have a test case that performs badly, you can now have CuArrays print GC timings by calling CuArrays.pool_timings() (building on @kristoffer.carlsson’s excellent TimerOutputs.jl):

$ julia -L application.jl -e "CuArrays.pool_timings()"
 ──────────────────────────────────────────────
                                 Time          
                         ──────────────────────
    Tot / % measured:         12.1s / 0.40%    

 Section         ncalls     time   %tot     avg
 ──────────────────────────────────────────────
 pooled alloc         3   48.5ms   100%  16.2ms
   1 try alloc        3   48.5ms   100%  16.2ms
 background task      1   7.66ms  15.8%  7.66ms
   scan               1   1.53μs  0.00%  1.53μs
   reclaim            1   1.44μs  0.00%  1.44μs
 ──────────────────────────────────────────────

This information should be useful to further optimize the allocator.

Minor improvements

CuArrays.jl:

  • initial support for logical indexing, filter and accumulate by @dpsanders
  • initial low-level wrappers for cublasxt by @kslimes

CUDAnative.jl:

  • fixes for use of StaticArrays.jl within kernels (requires Julia master/1.2)
  • support for threadfence, clock and nanosleep by @vchuravy
  • support for predicated synchronization (sync_threads_count, sync_threads_and, sync_threads_or) by @qin-yu
  • support for the CUDA device runtime by @vchuravy (paving the way for dynamic parallelism!)
15 Likes