Hello everyone,
I am using mapreduce
from CUDA.jl to compute the sum of function evaluations over an array of SVector{7,T}
data.
The code looks roughly like this:
function func(x1, x2, x3, x4, x5, x6, x7, x8, x9)
# very complicated calculation
out1 = ...
out2 = ...
out3 = ...
return SVector{3, Float64}(out1, out2, out3)
end
data = CuArray(zeros(SVector{7, Float64}, 2048))
mapreduce(vec -> func(2.0, 3.0, vec...), +, data)
However, I get the following error:
ERROR: LoadError: Number of threads per block exceeds kernel limit (640 > 512).
I am running this on our lab-maintained cluster. Interestingly, the same solver worked perfectly about a month ago, but now it suddenly fails with this error.
Does anyone know what might be causing this issue? Thanks.