Just to clarify: you could launch a 3D grid, simply by replacing
by something like
threads = (8, 8, 8) # probably not optimal
blocks = cld.(size(z_cu), threads)
run_CPU!
25.851 ms (42 allocations: 5.00 KiB)
run_array_GPU!
787.000 μs (119 allocations: 3.72 KiB)
Run GPU kernel: test_reshaped_1D_GPU!
776.200 μs (32 allocations: 576 bytes)
Run GPU kernel: test_CartesianIndices_3D_GPU!
777.400 μs (33 allocations: 592 bytes)
Run GPU kernel: test_nested_loops_GPU!
774.100 μs (32 allocations: 576 bytes)
There’s just not really any benefit to it over linear indexing, since there’s no 3D structure to exploit. If you would implement, say, a 2D image convolution, then using a 2D grid would actually make sense (though a 1D grid in combination with CartesianIndices would also work of course).