I prepared kernel with thread dimensions 4, 4,1 and iterating over array of dimensionality 16,16,2 and 32 blocks as far as my calculations all should work - all divides without remainder but I get constantly ERROR: Out-of-bounds array access

```
using CUDA
goldBoolGPU= CuArray(falses(16,16,2));
segmBoolGPU= CuArray(falses(16,16,2));
fn = CuArray([0])
function kernelFunct(goldBoolGPU::CuDeviceArray{Bool, 3, 1}, segmBoolGPU::CuDeviceArray{Bool, 3, 1},fn)
i= (blockIdx().x) * blockDim().x + threadIdx().x
j = (blockIdx().y) * blockDim().y + threadIdx().y
z = (blockIdx().z) * blockDim().z + threadIdx().z
if (goldBoolGPU[i,j,z] & !segmBoolGPU[i,j,z] )
@atomic fn[]+=1
end
return
end
@cuda threads=(4, 4,1) blocks=32 kernelFunct(goldBoolGPU,segmBoolGPU,fn)
```

error

ERROR: Out-of-bounds array access.

Thanks For Help