CUDA.jl: Unexpected `mapreduce` error: threads per block exceed GPU limit (640 > 512

Hello everyone,

I am using mapreduce from CUDA.jl to compute the sum of function evaluations over an array of SVector{7,T} data.
The code looks roughly like this:

function func(x1, x2, x3, x4, x5, x6, x7, x8, x9)
    # very complicated calculation
    out1 = ...
    out2 = ...
    out3 = ...
    return SVector{3, Float64}(out1, out2, out3)
end

data = CuArray(zeros(SVector{7, Float64}, 2048))

mapreduce(vec -> func(2.0, 3.0, vec...), +, data)

However, I get the following error:
ERROR: LoadError: Number of threads per block exceeds kernel limit (640 > 512).

I am running this on our lab-maintained cluster. Interestingly, the same solver worked perfectly about a month ago, but now it suddenly fails with this error.

Does anyone know what might be causing this issue? Thanks.

are you on CUDA 13.0? maybe downgrade it?

Without being an expert, I’d guess your kernel uses too many GPU registers, which limits the maximum number of threads per block.

A quick suggestion would be to replace your mapreduce call with AcceleratedKernel’s mapreduce, which allows you to tune the number of threads per block (via the block_size parameter).

This looks like Invalid kernel config generated by `mapreducedim!` with `SubArray` input and output · Issue #2863 · JuliaGPU/CUDA.jl · GitHub. I’ll take a look.

I am using CUDA 12.4.

That’s what I suspected as well. In the end, I ended up writing my own kernel function. Thanks for pointing out AcceleratedKernel. I wasn’t aware of it, and I’ll definitely take a look.

Thank you for taking a look. I saw the new commit. With the latest version (v5.9.0), mapreduce works for my case. However, I noticed it is significantly slower than before.

I also tested my own kernel, which uses CUDA.reduce_block at its core. The elapsed times were 780 μs on v5.8.3 and 19.5 ms on v5.9.0. Is this slowdown expected?

Some slowdown was expected, but that’s much too large. Can you open an issue?