Release: CUDAdrv/CUDAnative 2.0, CuArrays 1.0

maleadt · March 22, 2019, 2:36pm

Hi all,

We’re releasing new versions of some crucial packages that make up the Julia CUDA GPU stack today: CUDAdrv.jl 2.0, CUDAnative.jl 2.0 and CuArrays.jl 1.0 (following semver rules for breaking releases). The main addition is a GPU-specific pointer type to prevent erroneous conversions, but that’s a breaking change so please read along.

GPU-specific pointers

CUDAdrv now provides a new pointer type, CuPtr, that is ABI-compatible with Ptr but refuses to convert to and from these CPU pointers (as that would like crash Julia or CUDA somewhere down the road). This should avoid, or at least detect issues a la Calling a gpu model with a cpu array crashes julia · Issue #581 · FluxML/Flux.jl · GitHub. The new pointer type is used for buffers that are allocated by CUDAdrv. If you have low-level GPU code that, e.g., calls precompiled kernels, you will need to adjust your cudacall signatures to use CuPtr instead of regular Ptr.

CuArrays.jl has already been adapted to use these pointers. The changes are pretty mechanical, see for example Adapt CUBLAS. · JuliaGPU/CuArrays.jl@4652c11 · GitHub. However, test coverage of CuArrays is far from perfect, so it’s possible we’ve made mistakes without noticing. If your code breaks with errors like cannot convert a GPU pointer to a CPU pointer, please file an issue.

For ambiguous APIs that support both CPU and GPU pointers (such as parts of CUBLAS), there’s also a PtrOrCuPtr type.

Pooling allocator performance improvements

We now rely less on running Julia’s full GC collector, which should improve performance for some workloads.

However, the underlying problem remains: we don’t know when GPU buffers are available for reuse until the Julia GC (which doesn’t know about the GPU’s memory pressure) has kicked in. To work around this issue, you can now do an explicit CuArrays.unsafe_free! of a CuArray which marks it as available for reuse by the pooling allocator.

If you have a test case that performs badly, you can now have CuArrays print GC timings by calling CuArrays.pool_timings() (building on @kristoffer.carlsson’s excellent TimerOutputs.jl):

$ julia -L application.jl -e "CuArrays.pool_timings()"
 ──────────────────────────────────────────────
                                 Time          
                         ──────────────────────
    Tot / % measured:         12.1s / 0.40%    

 Section         ncalls     time   %tot     avg
 ──────────────────────────────────────────────
 pooled alloc         3   48.5ms   100%  16.2ms
   1 try alloc        3   48.5ms   100%  16.2ms
 background task      1   7.66ms  15.8%  7.66ms
   scan               1   1.53μs  0.00%  1.53μs
   reclaim            1   1.44μs  0.00%  1.44μs
 ──────────────────────────────────────────────

This information should be useful to further optimize the allocator.

Minor improvements

CuArrays.jl:

initial support for logical indexing, filter and accumulate by @dpsanders
initial low-level wrappers for cublasxt by @kslimes

CUDAnative.jl:

fixes for use of StaticArrays.jl within kernels (requires Julia master/1.2)
support for threadfence, clock and nanosleep by @vchuravy
support for predicated synchronization (sync_threads_count, sync_threads_and, sync_threads_or) by @qin-yu
support for the CUDA device runtime by @vchuravy (paving the way for dynamic parallelism!)

Topic		Replies	Views
Release: CUDAnative/CUDAdrv 1.0; CuArrays 0.9 Package Announcements	0	469	January 16, 2019
Freeing memory in the GPU with CUDAdrv / CUDAnative / CuArrays GPU	8	3072	November 13, 2018
CUDAnative: register host memory for pinned memory access GPU question	26	4159	September 3, 2021
Can the CuArray/CuPtr in CUDA.jl be directly passed to foreign functions? GPU question	2	964	September 22, 2020
ANN: CUDAnative 3.0 and CuArrays 2.0 Package Announcements	3	868	March 29, 2020

Release: CUDAdrv/CUDAnative 2.0, CuArrays 1.0

GPU-specific pointers

Pooling allocator performance improvements

Minor improvements

Related topics