So first of all, I should mention, then when I made the PR I only had maybe a dozen hours of GPU programming experience. So this code has lots of problems and is not an example of how to do things. Furthermore, it was meant as a slow baseline, that should be improved over the course of the tutorial. Sadly I did not yet find time to update the PR.
-
To improve this code, one has to use shared memory. Hopefully, I find time to update it.
-
These numbers are in no way ideal. Some general recommendation from the CUDA docs I remember is that threads should be
32 << threads <= 1024and blocks should be in the thousands.
I think blocks is indeed to high in the example.numthreads * numblocks > length(out)is not required. Each thread can loop over multiple elements in the code. -
I guess
atomic_addis not supported for complex numbers.