Error when implementing multidimensional kernel


I was implementing a very simple example of kernel programming with CUDA.jl. As an example, I took the addition of two matrices. The used code is the following:

function _add_kernel!(C, A, B, m, n)
    index_i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    index_j = (blockIdx().y - 1) * blockDim().y + threadIdx().y

    stride_i = blockDim().x * gridDim().x
    stride_j = blockDim().y * gridDim().y

    for i in index_i:stride_i:m
        for j in index_j:stride_j:n
            @inbounds C[i,j] = A[i,j] + B[i,j]
    return nothing

# dimension of the matrices
nA = 20
A = CUDA.rand(Float32, nA, nA)
B = CUDA.rand(Float32, nA, nA);

TA = eltype(A)
TB = eltype(B)

m, n = size(A)

T = promote_type(TA, TB)
C = similar(A, T, m, n)

kernel = @cuda launch=false _add_kernel!(C, A, B, m, n)
config = launch_configuration(
threads_i = min(size(A,1), config.threads)
threads_j = min(size(A,2), config.threads)
threads = (threads_i, threads_j)
blocks_i = cld(size(A,1), threads_i)
blocks_j = cld(size(A,2), threads_j)
blocks = (blocks_i, blocks_j)
CUDA.@sync kernel(C, A, B, m, n; threads=threads, blocks=blocks)

And it works well. However, when I increase the size of the matrices through the nA parameter I get different errors. If nA = 30 i get

CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES)

If nA > 32 i get

CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

I guess that the problem is related to the maximum dimension on the y dimension. How can I solve it? This is my device:

CuDevice(0): NVIDIA GeForce GTX 1650 Ti

has 896 cores (≈29.93^2). Guess this corresponds to this limit.

How do you get this value? From the code below I saw that the maximum number of threads on both the x an y dims is 1024

device = CUDA.device()
max_threads_x = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X)
max_threads_y = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Y)
max_threads_z = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Z)

println("Max threads in x dimension: ", max_threads_x)
println("Max threads in y dimension: ", max_threads_y)
println("Max threads in z dimension: ", max_threads_z)

max_threads_per_block = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
println("Max threads per block: ", max_threads_per_block)
Max threads in x dimension: 1024
Max threads in y dimension: 1024
Max threads in z dimension: 64
Max threads per block: 1024

I can’t find the 896 value you reported.

Now, how can I make it authomatic on different GPU, to avoid problems like this?

I recommend the CUDA C++ Programming Guide
for which CUDA.jl is an interface. The underlying
engine is described in the Nvidia docs.

Got it from googling. For example, this result:
Nvidia GTX 1650 Ti Specs Listed in Benchmark Results | Tom's Hardware.

I never went through this Guide very deeply, but I can’t find a section that discuss this situation.

By the way, I tried with nA = 29 and I get the same error. For somje reason the max value of total threads is 768 (and so nA <= 27).

I solved the problem introducing this contraint:

kernel = @cuda launch=false _add_kernel!(C, A, B, m, n)
config = launch_configuration(
dim_ratio = size(A,1) / size(A,2)
max_threads_i = floor(Int, sqrt(config.threads * dim_ratio))
max_threads_j = floor(Int, sqrt(config.threads / dim_ratio))

threads_i = min(size(A,1), max_threads_i)
threads_j = min(size(A,2), max_threads_j)
threads = (threads_i, threads_j)
blocks_i = cld(size(A,1), threads_i)
blocks_j = cld(size(A,2), threads_j)
blocks = (blocks_i, blocks_j)

The maximum number of threads is bounded by device limits (on each axis of the block and grid configuration), as well as by the kernel itself. If your kernel uses many registers or shared memory, fewer threads can be in flight. You should use the occupancy API for the actual bound; look for launch_configuration in the CUDA.jl source code or here on Discourse (we should probably document this better).

1 Like