@cuda threads and blocks confusion

I would really appreciate some help in understanding the threads and blocks options when using @cuda.

I read this: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/ and watched this GPU Programming in Julia - YouTube and I’m still a bit confused.

My understanding is that I should try to do:

maxPossibleThreads = attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X) # or maybe MAX_THREADS_PER_BLOCK?
threadsGPU = min(length(someArray,maxPossibleThreads)
blocksGPU = ceil(Int, length(someArray)/threadsGPU)
@cuda threads=threadsGPU blocks=blocksGPU funGPU!(someArray)

function funGPU!(someArray);
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for k in index:stride:length(someArray)

Here’s what I get from my GPU

julia> attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X)
julia> attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_GRID_DIM_X)

This makes me think that I can use 1024 in threads with 1024 blocks or maybe it’s 1024 threads and 1 block (I really don’t get the MAX_GRID_DIM_X). But I always get ERROR: CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES) when I try threads=1024. I get that error with a lot of combinations, actually (I clearly don’t know what I’m doing).
My GPU has 2304 CUDA cores, so why don’t I see that number instead of 1024? I guess since this array is smaller than 2304 it can be computed in one step, but I’m trying to implement the code so it works well for much larger arrays as well.

In summary, what’s the optimal combination of threads and blocks and how do I determine the maximum I can use on my system? Right now, for a problem with an array length of 1914, 16 threads and 120blocks is pretty similar to 8x240, 32x60 up to 256x8 are also similar (a bit faster than 16x120), and 512x4 up doesn’t run at all (same error as above).
Thanks a lot!

Max threads is further restricted by the resource usage of your kernel, that’s why you better use the occupancy API which takes that into account.

@makeadt Thanks. So I should use this:

function configurator(kernel)
    config = launch_configuration(kernel.fun);
    threads = Base.min(arrayLength, config.threads);
    blocks = cld(arrayLength, threads);
    return (threads=threads, blocks=blocks)
@cuda config=configurator funGPU!(someArray)

And continue to do indexing as above? That’s the best way?
How does that extend to a 2x2 matrix? Do I simply use the same index system (with .y instead of .x), but then how do I change my configurator? How do my threads/blocks scale with 2 dimensions? Square root?
Thanks a lot!

The API has recently been simplified: https://github.com/JuliaGPU/CUDA.jl/blob/4eb99b9f53acfc02a01f92d4a0a2b219bf8994cc/src/indexing.jl#L32-L36

But yes, it’s best to use the occupancy API. You need to extend this yourself to multiple dimensions, the occupancy API only works with 1D threads/blocks. Alternatively, just convert the linear thread index to an appropriate 2D one in your kernel.

@maleadt Sorry for being a noob, but can you share an example of how to use the new API?
Extending the index to 2D would work for me. No drawbacks in doing that? Can you help with the best way to do it? I’m guessing

threads = Base.min(arrayLength^2, config.threads);
blocks = cld(arrayLength^2, threads);

And then loop through the entire matrix (using a single index instead of [i,j]). But I just noticed there’s no limit on blocksGPU. Won’t that cause the too many resources error at some point?
Thanks a lot! Really appreciate the help!

I actually just added some logic very much like that to KernelAbstractions.jl, so you might want to take a look at the threads_to_workgroupsize function which will convert a number of threads to the optimal strides in each dimension: https://github.com/JuliaGPU/KernelAbstractions.jl/pull/206

In fact, it looks like you might even want to consider using KernelAbstractions instead of CUDA.jl directly, which will take care of a lot of that partitioning for you.

1 Like

@maleadt Ok, looking at the code I think I understand how to use it. I tested it in a small (length(someArray) = 404) case:

kernel = @cuda launch=false funGPU!(someArray)
config = launch_configuration(kernel.fun) # for a certain test this resulted in b=36, t=256
threads = Base.min(length(someArray), config.threads) # still 256
blocks =  cld(length(someArray), threads) # now changed to 2
@btime CUDA.@sync kernel(someArray; threads=256, blocks=2) # 27 msec
@btime CUDA.@sync @cuda threads=256 blocks=2 funGPU!(someArray) # 27 msec
# but I messed around with other values
@btime CUDA.@sync kernel(someArray; threads=16, blocks=32) # 8 msec

I’m probably still messing up, since the way I thought I was supposed to be doing things is clearly suboptimal. Thoughts?

@simeonschaub That’s quite interesting. I like it that you force trig functions to use their CUDA. variants. From trying to read the source code, I think it is using the same math as above to calculate threads and blocks, though, right? So the values would still be suboptimal for this case?

Might be relevant to mention that my function contains way more arguments that are of length 404 and 404x404 (funGPU!(A, B, C, D, ...) # where A(404x404), B(404x404), ... , E(404), F(404) ...). I tried 1D and 2D parallelization, both ended up exactly at 27 and 8 msec using the steps described above.

The occupancy API gives you the maximal launch configuration that still achieves 100% occupancy. But depending on your kernel, it might still be interesting to deviate from that, i.e., use fewer threads. Generally though, the API should give you a good and safe upper bound.

@maleadt Ok, then I’ll keep using that.
Thanks a lot!