What are the "limitations" of CUDA.jl relative to CPU code and where are they rooted

Benny · January 8, 2022, 9:02pm

Recently got my hands on an entry-level Nvidia GPU, so I started reading through the CUDA.jl docs and some related (and possibly outdated) articles. I definitely noticed that it can get very different from CPU code, and there can be some restrictions, relatively speaking. I’m asking for a more experienced perspective on the restrictions and where they’re rooted in: Julia, CUDA.jl, CUDA / CUDA C (seems to be the main implementation language), GPU, or elsewhere.

I think I’ve figured out where some restrictions come from, but I don’t know C/C++ so I get stuck whenever CUDA C shows up. Here’s a list of what I got, and feel free to correct me in the thread and I’ll try to make edits:

poor Float64 performance: GPU
no recursion : CUDA.jl, though CUDA’s support is limited by the GPU
kernel must return nothing : CUDA
no kernel varargs: CUDA, and so far I have not seen CUDA.jl kernels with VarArgs
no strings : CUDA (aside: a workaround is using arrays of Char, but I haven’t seen CUDA.jl examples of those so far)
must have type-inferred code : CUDA C is statically typed
no garbage collection on device : CUDA C has manual memory management
kernel cannot allocate, and only isbits types in device arrays: CUDA C has no garbage collection, and Julia has no manual deallocations, let alone on the device to deal with data that live independently of the CuArray.
no try-catch-finally in kernel: CUDA C does not support exception handling on device (v11.5.1 docs, Programming Guide I.4.7)
no scalar indexing of CuArray: just a scalar getindex in host code must put scalar on CPU, and CUDA.jl does not launch kernel for scalar setindex e.g. a[1] += 1 because it’s not worth it
calls to CPU-only runtime library: GPU can’t have a version of every low-level CPU function Julia has

vchuravy · January 9, 2022, 9:41pm

no exceptions

This depends, throwing an exception works on device and will lead for the kernel to be terminated. But you can’t use try ... catch on the device. Backtrace printing is quite costly so without -g2 CUDA.jl will tell you that an exception occurred but not precisely where. (It’s a compromise)

scalar indexing a CuArray moves datum to CPU (even when in-place a[1] += 1; surely kernels must be capable of that if a .+= 2 works)

That’s a limitation of the programming model. We could launch a kernel that does a[1] += 1 but that would be costlier then moving the memory back and forth (propbably).

calls to CPU-only runtime library (what does this even mean)

Julia is implemented with a runtime library that contains the GC, interaction with the OS, some part of the task scheduler, etc… For some it is reasonable to define a GPU alternative (and we do so) others make no sense on the GPU. Generally the GPU can’t efficiently call CPU functions due to the underlying execution model.

Topic		Replies	Views
Any qualifier in CUDA.jl like `__device__` in CUDA/C++? GPU question	6	293	April 26, 2024
Cuda (Julia vs C++) Performance cuda , benchmark , cudajl	4	945	February 12, 2024
Shared memory limitations GPU	4	948	April 29, 2020
Question about CUDA kernels GPU question	4	587	February 10, 2023
What does Julia GPU today? GPU question	6	3138	November 25, 2016

What are the "limitations" of CUDA.jl relative to CPU code and where are they rooted

Related topics