CUDA.jl 2.0: Per-thread streams, Float16, CUSPARSE clean-up

Hi all,

I’ve just tagged and release CUDA.jl 2.0, with several new features:

This release is slightly breaking because of the following changes:

  • per-thread streams: unlikely to break anything since few people are using threads with CUDA.jl
  • CUSPARSE clean-up: for example, switch2XXX methods are now convert methods
  • array dispatch changes: view/reinterpret/reshape are now represented using Base’s wrappers.

This last point isn’t technically breaking, but it’s likely that some methods that still dispatch on ::CuArray won’t get considered anymore now that, e.g., view(...) = ::SubArray{<:CuArray}. As a result, fall-back Base methods might get used instead of CUDA-specific implementations, triggering scalar iteration or invalid pointer conversions (GPU array to CPU pointer). The fix is to use DenseCuArray (if your method needs a CuPtr), StridedCuArray (for a CuPtr + strides) or AnyCuArray (for anything that can be used in a kernel). Please file issues if you encounter this with array operations from CUDA.jl or GPUArrays.jl.

Finally, since this is a breaking release, dependent packages like Flux.jl still needs to be updated or bumped so many users won’t be able to install CUDA.jl 2.0 just yet.


initial support for Float16 [here assuming Julia’s type] CUBLAS wrappers can be used with (B)Float16 inputs […]

julia> using BFloat16s

[…] Alternatively, CUBLAS can be configured to automatically down-cast 32-bit inputs to Float16.

How does this compare to CUDA from other languages, C++, or indirectly from Python?

My understanding is that Nvidia has this new type, and makes life easy for C++ programmers, that otherwise would have needed to replace lots of types.

Since you can downcast from Float32, I assume also from Float64 (just not directly, maybe not a big worry as nobody uses anyway… for ANNs)?

Python libraries, e.g. PyTorch has had some advantage over Julia and pure Julia packages, is this likely to close the gap (when Flux etc. support this updated wrapper). [You can also use PyTorch.jl/ThArrays.jl and bypass this wrapper. Is that something you would mix and match with the wrapper?]

That’s the global mode mentioned on the blogpost that lets CUBLAS downcast, there’s not really a special type, at least not for the user. Automatic downcast from Float64 isn’t currently supported by CUBLAS now though.