Yeah, don’t do that
A quick fix is to add CUDA.@atomic in front of those additions. That will make it slower, though. A better solution is to compute interim sums, aggregate those across the block, and then perform atomic additions at the grid level; but that’s of course much more invasive. Depending on the exact characteristics, CUDA.@atomic might perform well enough for you.
1 Like