I have a 3 dimension tensor, where each “slice” of the tensor is a matrix. I want to multiple each slice by another matrix and store there result in a 3D tensor/array.

How do I do that in the most efficient way using the GPU?

E.g. I have

```
using CUDA
CUDA.allowscalar(false)
tensor = rand(4, 4, 1000) |> cu
matrix = rand(4,4) |> cu
result = mapslices( slice-> slice*matrix, tensor, dims=(1,2))
```

This fails due to scalar indexing not allowed.

Do I need to write a kernel myself?