I would really appreciate some help in understanding the threads and blocks options when using @cuda.
My understanding is that I should try to do:
maxPossibleThreads = attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X) # or maybe MAX_THREADS_PER_BLOCK? threadsGPU = min(length(someArray,maxPossibleThreads) blocksGPU = ceil(Int, length(someArray)/threadsGPU) @cuda threads=threadsGPU blocks=blocksGPU funGPU!(someArray) function funGPU!(someArray); index = (blockIdx().x - 1) * blockDim().x + threadIdx().x stride = blockDim().x * gridDim().x for k in index:stride:length(someArray) ...
Here’s what I get from my GPU
julia> attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X) 1024 julia> attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_GRID_DIM_X) 2147483647 julia> attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK) 1024
This makes me think that I can use 1024 in threads with 1024 blocks or maybe it’s 1024 threads and 1 block (I really don’t get the MAX_GRID_DIM_X). But I always get
ERROR: CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES) when I try threads=1024. I get that error with a lot of combinations, actually (I clearly don’t know what I’m doing).
My GPU has 2304 CUDA cores, so why don’t I see that number instead of 1024? I guess since this array is smaller than 2304 it can be computed in one step, but I’m trying to implement the code so it works well for much larger arrays as well.
In summary, what’s the optimal combination of threads and blocks and how do I determine the maximum I can use on my system? Right now, for a problem with an array length of 1914, 16 threads and 120blocks is pretty similar to 8x240, 32x60 up to 256x8 are also similar (a bit faster than 16x120), and 512x4 up doesn’t run at all (same error as above).
Thanks a lot!