CUDA | nested loops kernel

Just to clarify: you could launch a 3D grid, simply by replacing

by something like

threads = (8, 8, 8)  # probably not optimal
blocks = cld.(size(z_cu), threads)
run_CPU!
  25.851 ms (42 allocations: 5.00 KiB)

run_array_GPU!
  787.000 μs (119 allocations: 3.72 KiB)

Run GPU kernel: test_reshaped_1D_GPU!
  776.200 μs (32 allocations: 576 bytes)

Run GPU kernel: test_CartesianIndices_3D_GPU!
  777.400 μs (33 allocations: 592 bytes)

Run GPU kernel: test_nested_loops_GPU!
  774.100 μs (32 allocations: 576 bytes)

There’s just not really any benefit to it over linear indexing, since there’s no 3D structure to exploit. If you would implement, say, a 2D image convolution, then using a 2D grid would actually make sense (though a 1D grid in combination with CartesianIndices would also work of course).