Error when implementing multidimensional kernel

albertomercurio · November 26, 2023, 12:34pm

Hello,

I was implementing a very simple example of kernel programming with CUDA.jl. As an example, I took the addition of two matrices. The used code is the following:

function _add_kernel!(C, A, B, m, n)
    index_i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    index_j = (blockIdx().y - 1) * blockDim().y + threadIdx().y

    stride_i = blockDim().x * gridDim().x
    stride_j = blockDim().y * gridDim().y

    for i in index_i:stride_i:m
        for j in index_j:stride_j:n
            @inbounds C[i,j] = A[i,j] + B[i,j]
        end
    end
    
    return nothing
end


# dimension of the matrices
nA = 20
A = CUDA.rand(Float32, nA, nA)
B = CUDA.rand(Float32, nA, nA);

TA = eltype(A)
TB = eltype(B)

m, n = size(A)

T = promote_type(TA, TB)
C = similar(A, T, m, n)

kernel = @cuda launch=false _add_kernel!(C, A, B, m, n)
config = launch_configuration(kernel.fun)
threads_i = min(size(A,1), config.threads)
threads_j = min(size(A,2), config.threads)
threads = (threads_i, threads_j)
blocks_i = cld(size(A,1), threads_i)
blocks_j = cld(size(A,2), threads_j)
blocks = (blocks_i, blocks_j)
CUDA.@sync kernel(C, A, B, m, n; threads=threads, blocks=blocks)
C

And it works well. However, when I increase the size of the matrices through the nA parameter I get different errors. If nA = 30 i get

CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES)

If nA > 32 i get

CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

I guess that the problem is related to the maximum dimension on the y dimension. How can I solve it? This is my device:

CuDevice(0): NVIDIA GeForce GTX 1650 Ti

Dan · November 26, 2023, 12:44pm

has 896 cores (≈29.93^2). Guess this corresponds to this limit.

albertomercurio · November 26, 2023, 12:58pm

How do you get this value? From the code below I saw that the maximum number of threads on both the x an y dims is 1024

device = CUDA.device()
max_threads_x = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X)
max_threads_y = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Y)
max_threads_z = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Z)

println("Max threads in x dimension: ", max_threads_x)
println("Max threads in y dimension: ", max_threads_y)
println("Max threads in z dimension: ", max_threads_z)

max_threads_per_block = CUDA.attribute(device, CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
println("Max threads per block: ", max_threads_per_block)

Max threads in x dimension: 1024
Max threads in y dimension: 1024
Max threads in z dimension: 64
Max threads per block: 1024

I can’t find the 896 value you reported.

Now, how can I make it authomatic on different GPU, to avoid problems like this?

devel-chm · November 26, 2023, 1:42pm

I recommend the CUDA C++ Programming Guide
for which CUDA.jl is an interface. The underlying
engine is described in the Nvidia docs.

Dan · November 26, 2023, 2:43pm

Got it from googling. For example, this result:
Nvidia GTX 1650 Ti Specs Listed in Benchmark Results | Tom's Hardware.

albertomercurio · November 26, 2023, 2:49pm

I never went through this Guide very deeply, but I can’t find a section that discuss this situation.

By the way, I tried with nA = 29 and I get the same error. For somje reason the max value of total threads is 768 (and so nA <= 27).

I solved the problem introducing this contraint:

kernel = @cuda launch=false _add_kernel!(C, A, B, m, n)
config = launch_configuration(kernel.fun)
dim_ratio = size(A,1) / size(A,2)
max_threads_i = floor(Int, sqrt(config.threads * dim_ratio))
max_threads_j = floor(Int, sqrt(config.threads / dim_ratio))

threads_i = min(size(A,1), max_threads_i)
threads_j = min(size(A,2), max_threads_j)
threads = (threads_i, threads_j)
blocks_i = cld(size(A,1), threads_i)
blocks_j = cld(size(A,2), threads_j)
blocks = (blocks_i, blocks_j)

maleadt · November 27, 2023, 5:01pm

The maximum number of threads is bounded by device limits (on each axis of the block and grid configuration), as well as by the kernel itself. If your kernel uses many registers or shared memory, fewer threads can be in flight. You should use the occupancy API for the actual bound; look for launch_configuration in the CUDA.jl source code or here on Discourse (we should probably document this better).