I prepared kernel with thread dimensions 4, 4,1 and iterating over array of dimensionality 16,16,2 and 32 blocks as far as my calculations all should work - all divides without remainder but I get constantly ERROR: Out-of-bounds array access
using CUDA
goldBoolGPU= CuArray(falses(16,16,2));
segmBoolGPU= CuArray(falses(16,16,2));
fn = CuArray([0])
function kernelFunct(goldBoolGPU::CuDeviceArray{Bool, 3, 1}, segmBoolGPU::CuDeviceArray{Bool, 3, 1},fn)
i= (blockIdx().x) * blockDim().x + threadIdx().x
j = (blockIdx().y) * blockDim().y + threadIdx().y
z = (blockIdx().z) * blockDim().z + threadIdx().z
if (goldBoolGPU[i,j,z] & !segmBoolGPU[i,j,z] )
@atomic fn[]+=1
end
return
end
@cuda threads=(4, 4,1) blocks=32 kernelFunct(goldBoolGPU,segmBoolGPU,fn)
How did you try debugging this? If I just add a single print statement (@cuprintln "goldBoolGPU[$i,$j,$z]") it’s obvious that you’re just actually indexing your arrays out-of-bounds. You probably want to do (blockIdx().z - 1) * blockDim().z since those values are 1-inded. Generally you also may not be able to perfectly divide your iteration space across blocks, so you’d want a branch that checks if the indices are in bounds.
Thank You @maleadt for response , and your time for again responding for my question .
I know that I am getting out of bounds, as stated in question , but I just do not understand why, and as far as I was debugging it I got consistently out of bounds for index I but not for j and z.
And i will try to include this minus 1 to indexing as you suggested Thanks !
I did not included bounds checking , as this code will work on medical images where slices are squares with edge of length 256,512 or 1024 so in theory when I iterate with 16 by 16 square over square where edge length is divisible by 16 it should be possible to go through all entries without any remainder if I understand it correctly
You’re creating a 16x16x2 array, but launching 32 blocks of each 4x4x1 threads, while (incorrectly) computing indices that go up to 132 (4x32+4). For y and z the block dimension is 1, so it still goes out of bounds but only by 1.
I recommend adding some print statements to understand how these values are affected by the launch configuration.