Recently got my hands on an entry-level Nvidia GPU, so I started reading through the CUDA.jl docs and some related (and possibly outdated) articles. I definitely noticed that it can get very different from CPU code, and there can be some restrictions, relatively speaking. I’m asking for a more experienced perspective on the restrictions and where they’re rooted in: Julia, CUDA.jl, CUDA / CUDA C (seems to be the main implementation language), GPU, or elsewhere.
I think I’ve figured out where some restrictions come from, but I don’t know C/C++ so I get stuck whenever CUDA C shows up. Here’s a list of what I got, and feel free to correct me in the thread and I’ll try to make edits:
- poor Float64 performance: GPU
- no recursion : CUDA.jl, though CUDA’s support is limited by the GPU
- kernel must return
nothing
: CUDA - no kernel varargs: CUDA, and so far I have not seen CUDA.jl kernels with VarArgs
- no strings : CUDA (aside: a workaround is using arrays of Char, but I haven’t seen CUDA.jl examples of those so far)
- must have type-inferred code : CUDA C is statically typed
- no garbage collection on device : CUDA C has manual memory management
- kernel cannot allocate, and only isbits types in device arrays: CUDA C has no garbage collection, and Julia has no manual deallocations, let alone on the device to deal with data that live independently of the CuArray.
- no try-catch-finally in kernel: CUDA C does not support exception handling on device (v11.5.1 docs, Programming Guide I.4.7)
- no scalar indexing of CuArray: just a scalar
getindex
in host code must put scalar on CPU, and CUDA.jl does not launch kernel for scalarsetindex
e.g.a[1] += 1
because it’s not worth it - calls to CPU-only runtime library: GPU can’t have a version of every low-level CPU function Julia has