Solve linear systems inside CUDA kernel function

That’s very unlikely to work. You cannot dynamically allocate memory inside a GPU kernel (see also this recent post: Modifying a thread-local vector within CUDA Dynamic Parallelism - #2 by vchuravy).

What should work though is to allocate all CuArrays outside the kernel, then inside the kernel convert the relevant views into your arrays into SMatrix/SVectors and do the solve on StaticArrays only. (I don’t have access to a GPU atm to check)