What are the "limitations" of CUDA.jl relative to CPU code and where are they rooted

Recently got my hands on an entry-level Nvidia GPU, so I started reading through the CUDA.jl docs and some related (and possibly outdated) articles. I definitely noticed that it can get very different from CPU code, and there can be some restrictions, relatively speaking. I’m asking for a more experienced perspective on the restrictions and where they’re rooted in: Julia, CUDA.jl, CUDA / CUDA C (seems to be the main implementation language), GPU, or elsewhere.

I think I’ve figured out where some restrictions come from, but I don’t know C/C++ so I get stuck whenever CUDA C shows up. Here’s a list of what I got, and feel free to correct me in the thread and I’ll try to make edits:

  • poor Float64 performance: GPU
  • no recursion : CUDA.jl, though CUDA’s support is limited by the GPU
  • kernel must return nothing : CUDA
  • no kernel varargs: CUDA, and so far I have not seen CUDA.jl kernels with VarArgs
  • no strings : CUDA (aside: a workaround is using arrays of Char, but I haven’t seen CUDA.jl examples of those so far)
  • must have type-inferred code : CUDA C is statically typed
  • no garbage collection on device : CUDA C has manual memory management
  • kernel cannot allocate, and only isbits types in device arrays: CUDA C has no garbage collection, and Julia has no manual deallocations, let alone on the device to deal with data that live independently of the CuArray.
  • no try-catch-finally in kernel: CUDA C does not support exception handling on device (v11.5.1 docs, Programming Guide I.4.7)
  • no scalar indexing of CuArray: just a scalar getindex in host code must put scalar on CPU, and CUDA.jl does not launch kernel for scalar setindex e.g. a[1] += 1 because it’s not worth it
  • calls to CPU-only runtime library: GPU can’t have a version of every low-level CPU function Julia has
1 Like
  • no exceptions

This depends, throwing an exception works on device and will lead for the kernel to be terminated. But you can’t use try ... catch on the device. Backtrace printing is quite costly so without -g2 CUDA.jl will tell you that an exception occurred but not precisely where. (It’s a compromise)

  • scalar indexing a CuArray moves datum to CPU (even when in-place a[1] += 1; surely kernels must be capable of that if a .+= 2 works)

That’s a limitation of the programming model. We could launch a kernel that does a[1] += 1 but that would be costlier then moving the memory back and forth (propbably).

  • calls to CPU-only runtime library (what does this even mean)

Julia is implemented with a runtime library that contains the GC, interaction with the OS, some part of the task scheduler, etc… For some it is reasonable to define a GPU alternative (and we do so) others make no sense on the GPU. Generally the GPU can’t efficiently call CPU functions due to the underlying execution model.

1 Like