CUDA.jl: Unexpected `mapreduce` error: threads per block exceed GPU limit (640 > 512

ykkan · August 22, 2025, 3:11pm

Hello everyone,

I am using mapreduce from CUDA.jl to compute the sum of function evaluations over an array of SVector{7,T} data.
The code looks roughly like this:

function func(x1, x2, x3, x4, x5, x6, x7, x8, x9)
    # very complicated calculation
    out1 = ...
    out2 = ...
    out3 = ...
    return SVector{3, Float64}(out1, out2, out3)
end

data = CuArray(zeros(SVector{7, Float64}, 2048))

mapreduce(vec -> func(2.0, 3.0, vec...), +, data)

However, I get the following error:
ERROR: LoadError: Number of threads per block exceeds kernel limit (640 > 512).

I am running this on our lab-maintained cluster. Interestingly, the same solver worked perfectly about a month ago, but now it suddenly fails with this error.

Does anyone know what might be causing this issue? Thanks.

jling · August 22, 2025, 6:29pm

are you on CUDA 13.0? maybe downgrade it?

jipolanco · August 25, 2025, 8:11am

Without being an expert, I’d guess your kernel uses too many GPU registers, which limits the maximum number of threads per block.

A quick suggestion would be to replace your mapreduce call with AcceleratedKernel’s mapreduce, which allows you to tune the number of threads per block (via the block_size parameter).

maleadt · September 1, 2025, 11:14am

This looks like Invalid kernel config generated by `mapreducedim!` with `SubArray` input and output · Issue #2863 · JuliaGPU/CUDA.jl · GitHub. I’ll take a look.

ykkan · September 5, 2025, 2:27pm

I am using CUDA 12.4.

ykkan · September 5, 2025, 2:30pm

That’s what I suspected as well. In the end, I ended up writing my own kernel function. Thanks for pointing out AcceleratedKernel. I wasn’t aware of it, and I’ll definitely take a look.

ykkan · September 5, 2025, 5:28pm

Thank you for taking a look. I saw the new commit. With the latest version (v5.9.0), mapreduce works for my case. However, I noticed it is significantly slower than before.

I also tested my own kernel, which uses CUDA.reduce_block at its core. The elapsed times were 780 μs on v5.8.3 and 19.5 ms on v5.9.0. Is this slowdown expected?

maleadt · September 9, 2025, 10:54am

Some slowdown was expected, but that’s much too large. Can you open an issue?

ykkan · September 16, 2025, 11:57am

Sorry for the late follow-up.

You suggested I open an issue last week, but it looks like this was already resolved at (if i understand it correct)
mapreduce: reinstate and fix block optimization (#2880).

I re-ran my script on different versions, and the elapsed times are now much closer:
780 μs on v5.8.3 and 1.089 ms on v5.9.0.

Do you still recommend that I open an issue, or can we consider this closed?

maleadt · September 18, 2025, 11:27am

A 25% regression is still problematic, but this may be a measuring artifact, as our CI benchmarks indicate no issue. If you have the time, please take a look with NSight Compute and if that does indeed show an issue (e.g. a suboptimal launch configuration), please file an issue.

Topic		Replies	Views
Strange behavior of `mapreduce` GPU	2	713	November 16, 2018
CUDAnative: hitting a 1024 limit when the result comes back GPU cudanative	4	1318	February 6, 2017
Organizing Threads and Block @cuda, 3D Arrays General Usage cudanative , cuda	2	2458	March 18, 2019
Synchronizing Cuda kernels GPU	5	2478	September 20, 2019
@cuda threads and blocks confusion GPU	9	3751	February 10, 2021

CUDA.jl: Unexpected `mapreduce` error: threads per block exceed GPU limit (640 > 512

Related topics