Slicing a CuArray in a kernel

Hi, I would like to take a slice from a CuArray in a kernel:

using CUDA
using StaticArrays
using LinearAlgebra

function run()
    a::Float64=5.0

    n=3
    vectors = CUDA.rand(Float64, 100,n)
    #vectors=CuArray([@SVector rand(n) for i in 1:100])
    results = CUDA.ones(100,n)#CuArray{SVector{2,Float64},1}(undef, 100)
    transform = Diagonal(@SVector ones(n)) * a
    function linear_transform_kernel(vectors,::Val{N}) where {N}
        i = threadIdx().x
        results[i,:].= -vectors[i,:]
        #results[i,:].= transform*vectors[i,:]
        return
    end
    @sync @cuda threads=100 linear_transform_kernel(vectors, Val(n))
    display(results)
end
run()

The slicing operation vectors[i,:] fails even in this simple example:

ERROR: LoadError: InvalidIRError: compiling MethodInstance for (::var"#linear_transform_kernel#12"{CuDeviceMatrix{Float32, 1}})(::CuDeviceMatrix{Float64, 1}, ::Val{3}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to ijl_alloc_array_1d)
Stacktrace:
  [1] Array
    @ ./boot.jl:477
  [2] Array
    @ ./boot.jl:486
  [3] Array
    @ ./boot.jl:494
  [4] similar
    @ ./abstractarray.jl:877
  [5] similar
    @ ./abstractarray.jl:876
  [6] similar
    @ ./broadcast.jl:224
  [7] similar
    @ ./broadcast.jl:223
  [8] copy
    @ ./broadcast.jl:928
  [9] materialize
    @ ./broadcast.jl:903
 [10] broadcast_preserving_zero_d
    @ ./broadcast.jl:892
 [11] -
    @ ./abstractarraymath.jl:218
 [12] linear_transform_kernel
    @ /nfs/c3po/home/ge78muc/terra-dg-group-1/cuda_exploration.jl:19
...

Interestingly, results[i,:] .= ... seems to work when vectors is a CuArray of SVectors.
What are some ways to get around this? From my understanding, it should be a common problem to access a row of a matrix in a CUDA kernel. Do you use a CuArray of a staticarray then? But this is immutable, which is undesirable for my usecaseā€¦

I found a fix:

using CUDA
using StaticArrays
using LinearAlgebra

function run()
    a::Float64=5.0

    n=3
    #vectors=CuArray([@SVector rand(n) for i in 1:100])
    vectors = CUDA.rand(Float64, 100,n)
    results = CUDA.ones(100,n)
    #results=CuArray{SVector{n,Float64},1}(undef, 100)
    transform = Diagonal(@SVector ones(n))*a
    function linear_transform_kernel(vectors,::Val{N}) where {N}
        i = threadIdx().x
        v= SVector{N,Float64}(@view vectors[i,:])
        @views results[i,:] .= transform * v
        return
    end
    @sync @cuda threads=100 linear_transform_kernel(vectors, Val(n))
    display(results)
end
run()

I think the @view is a way to avoid memory allocations when writing the slice to a StaticVector. Still, I would be interested in other ways of solving the problem.