Hello,
I am working on my understanding of blocks and threads/Block (which is called blockDimension).
So I wrote a simple kernel (called gpu_heavy! to impress my friends) that writes some values back to 4 CuArrays.
(Like blockIdx().x . etc )
Here under is boilerplate launching:
myKernel = @cuda name = "I_AM_SOME_KERNEL" launch = false gpu_heavy!(d_thisBlockIdx, d_thisBlockDimx, d_thisThreadIdx, d_thisGridDimx)
config = launch_configuration(myKernel.fun)
threads = Base.min(length(d_thisBlockIdx), config.threads)
blocks = cld(length(d_thisBlockIdx), threads)
println("According to launch_configuration optimal threads=", threads, " and optimal blocks is: ", blocks)
#And launch
CUDA.@time myKernel(d_thisBlockIdx, d_thisBlockDimx, d_thisThreadIdx, d_thisGridDimx; threads=threads, blocks=blocks)
I understand the above code uses the occupancy API, which returns reasonable values for the number of threds and blocks.
Is that correct?
the kernel looks like:
function gpu_heavy!(d_thisBlockIdx, d_thisBlockDimx, d_thisThreadIdx, d_thisGridDimx)
# Remember this kernel has no direct idea which portion of the workload it is processing
# so we need a way to index it to an unique part of the passed arrays (in this case they are all equally long)
thisBlockIdx = blockIdx().x
thisBlockDimx = blockDim().x # Blockdimension: I prefer to call it threadsPerBlock. There is also a possible y and z
thisThreadIdx = threadIdx().x
thisGridDimx = gridDim().x
index = (thisBlockIdx - 1) * thisBlockDimx + thisThreadIdx
stride = thisGridDimx * thisBlockDimx
for i = index:stride:length(d_thisBlockIdx)
@inbounds d_thisBlockIdx[i] = thisBlockIdx
@inbounds d_thisBlockDimx[i] = thisBlockDimx
@inbounds d_thisThreadIdx[i] = thisThreadIdx
@inbounds d_thisGridDimx[i] = thisGridDimx
end
return nothing
end
I understand both the threads and the blocks can be also 2 dimensional and 3 dimensional (the .y and .z).
But the above code ignores that fact completely.
Since I am not working on 2 or 3 Dimensions, that is fine for me.
My question is this:
My call to CUDA.@time myKernel(d_thisBlockIdx, d_thisBlockDimx, d_thisThreadIdx, d_thisGridDimx; threads=threads, blocks=blocks)
launches the work on the GPU.
But is it possible there are also y and z dimensions coming from config = launch_configuration(myKernel.fun)
(which I am clearly ignoring)?
Thanks for your time!
Erwin