CUDAnative: hitting a 1024 limit when the result comes back

I have a working kernel for a basic matrix multiplication based on the add example. It checks out fine for small two dimensional matrices, but hits a problem with output matrix greater than 32x32.
Here is how I send up my matrices:

    (rowsa,colsa) = dimsa = (32,648)
    a = randn(Float32,dimsa)
    (rowsb,colsb) = dimsb = (648,32)
    b = randn(Float32,dimsb)
    c = zeros(Float32,(rowsa,colsb))

When matrix a is multiplied with matrix b the result should be 32x32 which it is, and the answer tests correct.

Note that I am using a zeros() function here; if I use the similar() function I get accumulation errors since similar does not initialize and I am using a += operator. The resulting matrix is 32x32 so I reserve the memory for c. The output should fit easily into the card grid which is 1024x1024.

The problem arises when I push the output to 33x32 or 32x33. Cuda errors out. Evidently I have hit a 1024 limit somewhere. The output matrix c should fit easily if it was indeed a multidimensional array, but perhaps it has been linearized and overflows a block. Am I missing a key point here?

How? What’s the error code? Plain launch failure?

You don’t show much CUDAnative-related code here, but I’d guess that you are hitting the 1024 threads limit. From the deviceQuery output you posted here, it seems that your GPU has a (1024, 1024, 64) block dimension limitation. Maybe you are exceeding any of those dimensions when launching your kernel?

Sorry about that; the error is:

ERROR: LoadError: CUDAdrv.CuError(1, Nullable{String}())
Stacktrace:
 [1] macro expansion at /home/colin/.julia/v0.6/CUDAdrv/src/base.jl:61 [inlined]
 [2] launch(::CUDAdrv.CuFunction, ::CUDAdrv.CuDim3, ::CUDAdrv.CuDim3, ::Int64, ::CUDAdrv.CuStream, ::Array{Any,1}) at /home/colin/.julia/v0.6/CUDAdrv/src/execution.jl:43
 [3] #cudacall#22(::Int64, ::CUDAdrv.CuStream, ::Function, ::CUDAdrv.CuFunction, ::Tuple{Int64,Int64}, ::Tuple{Int64,Int64}, ::Type{T} where T<:Tuple, ::CUDAdrv.DevicePtr{CUDAnative.CuDeviceArray{Float32,2}}, ::Vararg{Any,N} where N) at /home/colin/.julia/v0.6/CUDAdrv/src/execution.jl:68
 [4] macro expansion at /home/colin/.julia/v0.6/CUDAnative/src/execution.jl:141 [inlined]
 [5] #generated_cuda#58(::Array{Any,1}, ::CUDAnative.#generated_cuda, ::Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64}}, ::#kernel_mmul, ::CUDAdrv.CuArray{Float32,2}, ::CUDAdrv.CuArray{Float32,2}, ::CUDAdrv.CuArray{Float32,2}, ::Int64) at /home/colin/.julia/v0.6/CUDAnative/src/execution.jl:200
 [6] main() at /home/colin/cuda/julia/testmm.jl:31
 [7] include_from_node1(::String) at ./loading.jl:539
 [8] include(::String) at ./sysimg.jl:14
 [9] process_options(::Base.JLOptions) at ./client.jl:300
 [10] _start() at ./client.jl:366
while loading /home/colin/cuda/julia/testmm.jl, in expression starting on line 42

But error 1 seems to have multiple meanings so I did not think it was helpful in this context.

However I think you have identified my error in understanding; the issue is that even though I might have a grid of 1024x1024 available, only 1024 threads are available at a time, so 1024x1 would be good as would 32x32 but that is it for one warp. I need to investigate the proper use of memory on the card when handling larger matrices.

This is fun, thank you for all the good work you are putting in.

Error 1 is CUDA_ERROR_INVALID_VALUE, which you can get almost everywhere indeed. Should have been rendered properly though; I filed a bug for that.

Doing proper memory management (possibly dealing with larger-than-threadblock data) is the real complexity of programming GPUs, using any of the different communication primitives (shared|global memory + atomics|barriers, shuffle, voting, etc) and rewriting your algorithm to use them. Lots of fun, but pretty hard :slight_smile:
Check out the scan and reduce examples in the CUDAnative repo for some ideas.

1 Like

I take back the comment about warps and memory, in this case it was a matter of knowing what to do when the thread count is a fractional multiple of the block size. Since the iterations are taken care of transparently it is not obvious what to do when block 1 is full size but block 2 needs to be stopped at some maximum count.

Take my case of using 32x32 blocks to stay within the 1024 limit. To handle a multiplication of 63x1 and 1x25 the 63 calls for a block count of 2, but the last thread must not do anything. I have handled this by passing the 63 count to the kernel and then adding the code

i = Int32((blockIdx().x-one) * bdx + threadIdx().x)
if i>imax
    return nothing
end

This seems to test out correctly. From my timing experiments the if statement does not seem to impose a great burden. While learning it pays to have a small resource limited card, you run into road blocks that much sooner.