CUDA sum kernels, threads and blocks, complex values

So first of all, I should mention, then when I made the PR I only had maybe a dozen hours of GPU programming experience. So this code has lots of problems and is not an example of how to do things. Furthermore, it was meant as a slow baseline, that should be improved over the course of the tutorial. Sadly I did not yet find time to update the PR.

  1. To improve this code, one has to use shared memory. Hopefully, I find time to update it.

  2. These numbers are in no way ideal. Some general recommendation from the CUDA docs I remember is that threads should be 32 << threads <= 1024 and blocks should be in the thousands.
    I think blocks is indeed to high in the example. numthreads * numblocks > length(out) is not required. Each thread can loop over multiple elements in the code.

  3. I guess atomic_add is not supported for complex numbers.

1 Like