What is `Base.allocatedinline` and custom types for `CuArray` (`CUDA.jl`) involving `Missing`

I am trying to pass an array of a custom type to the GPU using CUDA.jl. Here is a MWE:

using CUDA

struct MyStruct
    arr::Vector{Tuple{Int, Int}}

cpu_arr = Vector{Union{Missing, MyStruct}} # Works as expected
gpu_arr = CuVector(cpu_arr) # ERROR: CuArray only supports element types that are stored inline

Looking into this error, I find that in CUDA.jl/src/array.jl at line 36 there is a call to Base.allocatedinline(T). When this call returns false, the error is thrown. At first, I thought this could simply be solved by using Adapt.jl. However, when I add Adapt.@adapt_structure MyStruct before calling CuVector(cpu_arr), nothing changes and I get the same error.

So then, I turned my attention to Base.allocatedinline(T). I have no idea what this function does see source code because it calls jl_stored_inline which I could not find any documentation about. Also see the only discussion about this function I could find. These don’t really help in trying to figure out how to solve my issue. So my first question is: what exactly does Base.allocatedinline(T) do?

I’m especially confused by this function because calling Base.allocatedinline(MyStruct) returns true, and so does calling Base.allocatedinline(Missing). However, the call to Base.allocatedinline(Union{Missing, MyStruct}) returns false (hence the error from the MWE). I thought this was maybe a property of Union, but it turns out not to be the case. A call to Base.allocatedinline returns true for both Int and Bool, and so does Base.allocatedinline(Union{Int, Bool}). So at this point, I’m left a little confused about this function and what I can do to my code to make sure that I can call CuVector on an array of my custom struct.

I understand that what I’m trying to do is not particularly niche. As such, getting odd errors that I can’t find much help for online tells me that I am doing something very wrong. So on top of my previous question, I would like to ask: am I doing this right? Does it even make sense to create CuArrays of custom types? Are there more standard ways of solving this kind of problem?

Thank you!

There’s several issues with this MWE. First of all, cpu_arr is a type, not a value. Your MyStruct should be parametric because Vector can never live on the GPU. Then there’s the problem of arrays-of-arrays, which doesn’t really work, see Arrays of arrays and arrays of structures in CUDA kernels cause random errors - #12 by fedoroff.

An example how this could work:

using CUDA, Adapt

struct MyStruct{T <: AbstractVector{Tuple{Int,Int}}}
MyStruct(x::T) where T = MyStruct{T}(x)
Adapt.adapt_structure(to, x::MyStruct) = MyStruct(adapt(to, x.arr))

function main()
    cpu_arr = [MyStruct(Vector{Tuple{Int, Int}}())]

    # first, convert the contained MyStruct{Vector} elements
    gpu_arr = adapt.(CuArray, cpu_arr)

    # we could now convert gpu_arr::Vector to a CuArray, but since we'll also have to
    # convert the MyStruct{CuArray} elements to MyStruct{CuDeviceArray} (because @cuda
    # doesn't automatically convert array _elements_ when calling a kernel), we'll
    # leave it on the CPU

    GC.@preserve gpu_arr begin
        gpu_device_arr = cudaconvert.(gpu_arr)
        final_gpu_arr = CuArray(gpu_device_arr)

        # final_gpu_arr is now a CuArray of MyStruct{CuDeviceArray} elements, which we can
        # pass to a kernel (@cuda will do the outer CuArray->CuDeviceArray conversion)
        @show typeof(final_gpu_arr)

But I recommend simplifying the setup instead.

Thank you! This has given me a lot to think about.

In the thread you linked, you say

CuArrays live on the GPU, so the CPU GC doesn’t see them, which means the CuArray elements might get freed early. Furthermore, CuArrays are not GPU compatible, they need to be converted to a CuDeviceArray […]

I don’t quite get how CuArrays live on the GPU but are not GPU compatible From what I understand form that thread, the GPU can only perform computations using CuDeviceArray. So then CuArray is some kind of device-host hybrid that has a reference on the CPU to make sure the GC doesn’t free anything, but converts its data to a CuDeviceArray when it is needed for a kernel launch?

So then, if I understood that discussion correctly, either I have to deal with the fact that to have arrays of arrays I’ll have to have many conversions from CuArray to CuDeviceArray for each kernel launch (very expensive from what I gather), or I can find a way to simplify everything to a CuArray and not have to worry about any of this.

Correct. CuArrays represent device memory, but for use on the host (i.e. to queue operations, manage the memory, track it using the GC, etc). But the layout of that struct is not a bits type (among other things, because it contains a reference the the CUDA context it was allocated in) so we need a device-side representation that can be used within kernels executing on the GPU.

Regardless of that division, the problem with CuArrays of CuArrays would be that the GC doesn’t scan GPU memory, so it wouldn’t keep the allocations of the CuArray elements alive.

Simplifying your application to a single ‘layer’ of CuArrays is probably the easiest solution. Note that CuArray element types can be complex, so you could to an structure-of-arrays/arrays-of-structures style transformation. But if you want to keep the arrays-of-arrays, know that a CuDeviceArray just points to an existing CuArray, so you could prepare the arguments once and launch the kernel multiple times.