Hello,

I was implementing a very simple example of kernel programming with CUDA.jl. As an example, I took the addition of two matrices. The used code is the following:

```
function _add_kernel!(C, A, B, m, n)
index_i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
index_j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
stride_i = blockDim().x * gridDim().x
stride_j = blockDim().y * gridDim().y
for i in index_i:stride_i:m
for j in index_j:stride_j:n
@inbounds C[i,j] = A[i,j] + B[i,j]
end
end
return nothing
end
# dimension of the matrices
nA = 20
A = CUDA.rand(Float32, nA, nA)
B = CUDA.rand(Float32, nA, nA);
TA = eltype(A)
TB = eltype(B)
m, n = size(A)
T = promote_type(TA, TB)
C = similar(A, T, m, n)
kernel = @cuda launch=false _add_kernel!(C, A, B, m, n)
config = launch_configuration(kernel.fun)
threads_i = min(size(A,1), config.threads)
threads_j = min(size(A,2), config.threads)
threads = (threads_i, threads_j)
blocks_i = cld(size(A,1), threads_i)
blocks_j = cld(size(A,2), threads_j)
blocks = (blocks_i, blocks_j)
CUDA.@sync kernel(C, A, B, m, n; threads=threads, blocks=blocks)
C
```

And it works well. However, when I increase the size of the matrices through the `nA`

parameter I get different errors. If `nA = 30`

i get

```
CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES)
```

If `nA > 32`

i get

```
CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
```

I guess that the problem is related to the maximum dimension on the y dimension. How can I solve it? This is my device:

```
CuDevice(0): NVIDIA GeForce GTX 1650 Ti
```