Basically I am going to transform mapslices() using kernels. This function slices each layer from a 3D array and multiplies each layer by the same matrix, like below

```
A = rand(4, 4, 3)
B = rand(4, 4)
C = similar(A)
C = mapslices(x -> x * B, A, dims = [1, 2])
```

But when I tried to write the kernel

```
using CUDA
A = CUDA.rand(4, 4, 3)
B = CUDA.rand(4, 4)
C = similar(A)
function kernel!(C, A, B)
i = threadIdx().x
if (i <= size(A, 3))
@inbounds C[:, :, i] = A[:, :, i] * B
end
return nothing
end
@cuda threads = size(A, 3) kernel!(C, A, B)
```

When I tried running this kernel, the Julia REPL unexpectedly quit and did not show any results. I attempted to run the kernels with lower dimensional arrays, but the result showed errors. I suspect that accessing a column, row, or layer might not be good in the kernel, as opposed to accessing a single element.

Am I right? If so, are there other ways to execute `mapslices()`

from my example in parallel? Iâ€™ve considered multithreading, but it seems to work serially, not in parallel.