Trying to understand 3d indexing

I prepared kernel with thread dimensions 4, 4,1 and iterating over array of dimensionality 16,16,2 and 32 blocks as far as my calculations all should work - all divides without remainder but I get constantly ERROR: Out-of-bounds array access

using CUDA

goldBoolGPU= CuArray(falses(16,16,2));

segmBoolGPU= CuArray(falses(16,16,2));

fn = CuArray([0])

function kernelFunct(goldBoolGPU::CuDeviceArray{Bool, 3, 1}, segmBoolGPU::CuDeviceArray{Bool, 3, 1},fn)

    i= (blockIdx().x) * blockDim().x + threadIdx().x

    j = (blockIdx().y) * blockDim().y + threadIdx().y

    z = (blockIdx().z) * blockDim().z + threadIdx().z 

    if (goldBoolGPU[i,j,z] & !segmBoolGPU[i,j,z] )

        @atomic fn[]+=1    




@cuda threads=(4, 4,1) blocks=32  kernelFunct(goldBoolGPU,segmBoolGPU,fn)


ERROR: Out-of-bounds array access.

Thanks For Help

How did you try debugging this? If I just add a single print statement (@cuprintln "goldBoolGPU[$i,$j,$z]") it’s obvious that you’re just actually indexing your arrays out-of-bounds. You probably want to do (blockIdx().z - 1) * blockDim().z since those values are 1-inded. Generally you also may not be able to perfectly divide your iteration space across blocks, so you’d want a branch that checks if the indices are in bounds.

Thank You @maleadt for response , and your time for again responding for my question .

I know that I am getting out of bounds, as stated in question , but I just do not understand why, and as far as I was debugging it I got consistently out of bounds for index I but not for j and z.

And i will try to include this minus 1 to indexing as you suggested Thanks !

I did not included bounds checking , as this code will work on medical images where slices are squares with edge of length 256,512 or 1024 so in theory when I iterate with 16 by 16 square over square where edge length is divisible by 16 it should be possible to go through all entries without any remainder if I understand it correctly

You’re creating a 16x16x2 array, but launching 32 blocks of each 4x4x1 threads, while (incorrectly) computing indices that go up to 132 (4x32+4). For y and z the block dimension is 1, so it still goes out of bounds but only by 1.

I recommend adding some print statements to understand how these values are affected by the launch configuration.