Broadcasting using slices on the GPU

I’m trying to broadcast a two argument function over slices on the second argument like so:

# using CUDA
# todevice = cu
todevice = identity

x = rand(5) |> todevice
a = rand(2, 5) |> todevice

function kernel(x, a)
    a * x  # dummy computation

kernel.(x, eachcol(a))

This kind of works on the CPU but produces an array of arrays instead of a 2d array. When trying to do the same on the GPU I get the following error:

ERROR: LoadError: CuArray only supports element types that are stored inline

Which I assume is related to the array of arrays issue.

Do you know if there is a way to make this work on the GPU without array mutation, since I also want to compute gradients using Zygote?

Creating arrays of arrays like that is not supported. And slicing functions like eachcol/eachrow/mapslices generally don’t perform well on the GPU anyway, since they generally result in a kernel being launched for each row/column/slice.