CUDA.jl: Unexpected `mapreduce` error: threads per block exceed GPU limit (640 > 512

Hello everyone,

I am using mapreduce from CUDA.jl to compute the sum of function evaluations over an array of SVector{7,T} data.
The code looks roughly like this:

function func(x1, x2, x3, x4, x5, x6, x7, x8, x9)
    # very complicated calculation
    out1 = ...
    out2 = ...
    out3 = ...
    return SVector{3, Float64}(out1, out2, out3)
end

data = CuArray(zeros(SVector{7, Float64}, 2048))

mapreduce(vec -> func(2.0, 3.0, vec...), +, data)

However, I get the following error:
ERROR: LoadError: Number of threads per block exceeds kernel limit (640 > 512).

I am running this on our lab-maintained cluster. Interestingly, the same solver worked perfectly about a month ago, but now it suddenly fails with this error.

Does anyone know what might be causing this issue? Thanks.

are you on CUDA 13.0? maybe downgrade it?