Problems with LinearAlgebra functions within KernelAbstractions and CUDA

I’m almost done porting my solver over to GPU with the help of CUDA and KernelAbstractions. However, I keep running into problem with LinearAlgebra functions. Take this simple example

using StaticArrays
using CUDA: allowscalar,CuArray
using KernelAbstractions
using LinearAlgebra: norm2,×
using BenchmarkTools

N = (100,100,100)
f = Array # CuArray
p = zeros(Float32,N) |> f;
u = rand(Float32,N...,3) |> f;
@kernel function kern(p,u)
    I = @index(Global,Cartesian)
    p[I] = norm2(SVector{3}(I.I) × SVector{3}(u[I,:]))
apply!(p,u) = kern(get_backend(p),64)(p,u,ndrange=size(p))
@btime apply!(p,u); # 16.459 ms (1000199 allocations: 76.31 MiB)

This gives a crazy number of allocations running on a CPU and won’t run on the GPU

Reason: unsupported call through a literal pointer (call to ijl_alloc_array_1d)
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) in Base at strings/io.jl:133)
Reason: unsupported dynamic function invocation (call to dimension_mismatch_fail(SA::Type, a::AbstractArray) in StaticArrays at C:\Users\gweymouth\.julia\packages\StaticArrays\4uslg\src\convert.jl:190)

If I take out the SVector conversion, the allocations go way up on the CPU and the GPU throws errors about Reason: unsupported dynamic function invocation (call to mapreduce_empty_iter(f, op, itr, ItrEltype) in Base at reduce.jl:375).

It seems the main issue is with SVector{3}(u[I,:]). Replacing this with SA[u[I,1],u[I,2],u[I,3]] runs and cuts most of the allocations. Can someone explain why?

Does a view on u[I,:] help ?