Suppose, I have a multidimensional array and I want to perform some operation along one of its axes. A sample generic code can look like this:

```
s = (1, 2, 3, 4, 5) # shape of the array
dim = 3 # dimension, along which I want to perform the operation
A = zeros(s)
spre = s[1:dim-1]
sdim = s[dim]
spost = s[dim+1:end]
Cpre = CartesianIndices(spre)
Cpost = CartesianIndices(spost)
for Ipost in Cpost
for Ipre in Cpre
for i=1:sdim
A[Ipre, i, Ipost] = i
end
end
end
```

Next, I would like to launch these `for`

loops on GPU using CUDA. To do it in a most efficient way I have to merge the two outer loops into one with an index through which I can iterate with strides. For that reason I need to merge the two Cartesian indices `Cpre`

and `Cpost`

into one set of indices. On CPU I came to the following solution:

```
Cprepost = CartesianIndices((spre..., spost...))
for I in Cprepost
Ipre = Tuple(I)[1:dim-1]
Ipost = Tuple(I)[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
```

By analogy, I would expect that the kernel for GPU will look like

```
using CUDA
function kernel(A, dim, Cprepost, sdim)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
for I in Cprepost[id:stride:end]
Ipre = Tuple(I)[1:dim-1]
Ipost = Tuple(I)[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
return nothing
end
Agpu = CUDA.zeros(s)
@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)
```

However, this code causes â€śunsupported call through a literal pointerâ€ť error.

The only kernel which so far works for me is the following one:

```
function kernel(A, dim, Cprepost, sdim)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
dim = 3
for k=id:stride:length(Cprepost)
Ipre = Tuple(Cprepost[k])[1:dim-1]
Ipost = Tuple(Cprepost[k])[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
return nothing
end
@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)
```

However, here I have to explicitly define the `dim`

variable within the kernel, because otherwise I obtain the same â€śunsupported call through a literal pointerâ€ť error.

Can you please help me to write the corresponding CUDA kernel.

Thank you.