How to create sliced views of `CuArray` correctly?

Hi, I would like to split and reshape a CuArray to a StructArray “in place” as views. I used sliced views of the input CuArray and got a warning about slicing. Is there a correct way of doing the following?

julia> siz=(2,4); L=prod(siz);

julia> using CUDA, StructArrays

julia> function vec2sa(cuv,siz)
       L=prod(siz)
       cuv1=reshape(view(cuv,1:L),siz)
       cuv2=reshape(view(cuv,L+1:2L),siz)
       cusa=StructArray((cuv1,cuv2))
       return cusa
       end
vec2sa (generic function with 1 method)

julia> cuv=CuArray(rand(2L))
16-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.3795848396969945
 0.2971222868527267
 0.7780369899925469
 0.2550287880476231
 0.011486317937175805
 0.916967110456904
 0.949779698848461
 0.9265215896999244
 0.727648535419387
 0.2176276431142744
 0.3315939522658926
 0.5243095426809381
 0.5445176095969979
 0.21718867942907105
 0.10315829876358462
 0.20188172113924052

julia> cusa=vec2sa(cuv,siz)
2×4 StructArray(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}) with eltype Tuple{Float64, Float64}:
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007f672e5888b0.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/B3xv7/src/GPUArraysCore.jl:103
 (0.379585, 0.727649)  …  (0.94978, 0.103158)
 (0.297122, 0.217628)     (0.926522, 0.201882)

Looks like there is not much overhead relating to this slicing (compared to other typical operations on the CuArray):

julia> function vec2sa(cuv,siz)
       L=prod(siz); CUDA.allowscalar(true)
       cuv1=reshape(view(cuv,1:L),siz)
       cuv2=reshape(view(cuv,L+1:2L),siz)
       cusa=StructArray((cuv1,cuv2))
       return cusa
       end
vec2sa (generic function with 1 method)

julia> siz=(256,256,128); L=prod(siz);

julia> cuv=CuArray(rand(ComplexF64,2L));

julia> using BenchmarkTools

julia> @btime CUDA.@sync cusa=vec2sa($cuv,$siz);
  380.704 ns (5 allocations: 240 bytes)

julia> using FFTW

julia> pc= plan_fft!(cusa.:1);

julia> pci= plan_ifft!(cusa.:1);

julia> @btime CUDA.@sync begin
       $pc*$cusa.:1
       $pc*$cusa.:2
       $pci*$cusa.:1
       $pci*$cusa.:2
       end;
  30.668 ms (51 allocations: 3.03 KiB)