However, I get the following error: ERROR: LoadError: Number of threads per block exceeds kernel limit (640 > 512).
I am running this on our lab-maintained cluster. Interestingly, the same solver worked perfectly about a month ago, but now it suddenly fails with this error.
Does anyone know what might be causing this issue? Thanks.
Without being an expert, I’d guess your kernel uses too many GPU registers, which limits the maximum number of threads per block.
A quick suggestion would be to replace your mapreduce call with AcceleratedKernel’s mapreduce, which allows you to tune the number of threads per block (via the block_size parameter).
That’s what I suspected as well. In the end, I ended up writing my own kernel function. Thanks for pointing out AcceleratedKernel. I wasn’t aware of it, and I’ll definitely take a look.
Thank you for taking a look. I saw the new commit. With the latest version (v5.9.0), mapreduce works for my case. However, I noticed it is significantly slower than before.
I also tested my own kernel, which uses CUDA.reduce_block at its core. The elapsed times were 780 μs on v5.8.3 and 19.5 ms on v5.9.0. Is this slowdown expected?