What is `Base.allocatedinline` and custom types for `CuArray` (`CUDA.jl`) involving `Missing`

Correct. CuArrays represent device memory, but for use on the host (i.e. to queue operations, manage the memory, track it using the GC, etc). But the layout of that struct is not a bits type (among other things, because it contains a reference the the CUDA context it was allocated in) so we need a device-side representation that can be used within kernels executing on the GPU.

Regardless of that division, the problem with CuArrays of CuArrays would be that the GC doesn’t scan GPU memory, so it wouldn’t keep the allocations of the CuArray elements alive.

Simplifying your application to a single ‘layer’ of CuArrays is probably the easiest solution. Note that CuArray element types can be complex, so you could to an structure-of-arrays/arrays-of-structures style transformation. But if you want to keep the arrays-of-arrays, know that a CuDeviceArray just points to an existing CuArray, so you could prepare the arguments once and launch the kernel multiple times.