Praise: CUDA.allowscalar(false) is great

One single line makes me substantially more productive (at writing GPU code), and I’d like to highlight how much I love this pattern. I’ve had vague ideas on how to express this kind of concept before, but CUDA.allowscalar(false) is just a really practical and useful workhorse. It’s a really nice angle towards the “hitting a fallback that silently kills your performance” problem.

  1. Write the simplest possible implementation using scalar indexing
  2. Write some tests to make sure it works on CPU, run it in REPL
  3. CUDA.allowscalar(false) and then incrementally replace with broadcasting or kernels until it stops erroring.

I also find this pattern (incrementally change until the errors go away but tests pass) to be productive for working with AI agents. CUDA.jl loudly rather than silently hits fallbacks that kill your performance, and it ends up accelerating development of fast code. I really love it.

(DispatchDoctor.jl enables a similar workflow for type inference.)

12 Likes