Arrays of arrays and arrays of structures in CUDA kernels cause random errors

Hello,

I want to pass an array of structures to CUDA kernel. Each structure contains an array of CuArrays.
Most of the time my code works well, but sometimes randomly I see errors of the following type:

ERROR: Out-of-bounds array access.
WARNING: Error while freeing DeviceBuffer(96 bytes at ....)
LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

The MWE where I managed to observe this behaviour is the following:

using Adapt
using CUDA


struct Integrator{P, K}
    p :: P   # Tuple
    ks :: K   # Array of CuArrays
end


function Adapt.adapt_structure(to, integ::Integrator)
    ks = [Adapt.adapt_structure(to, integ.ks[i]) for i=1:length(integ.ks)]
    ks = CuArray(ks)
    ks = Adapt.adapt_structure(to, ks)
    return Integrator(integ.p, ks)
end


function kernel(u, integs)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    Nu, Np = size(u)

    for ip=id:stride:Np
        ks = integs[ip].ks

        for iu=1:Nu
        for i=1:4
            u[iu, ip] += ks[i][iu] + 1
        end
        end
    end
    return nothing
end


# ******************************************************************************
Nu = 100
ks = [CUDA.zeros(Nu) for i=1:4]

Np = 10000
a = range(1f0, 3f0, length=Np)

integs = Array{Integrator}(undef, Np)
for i=1:Np
    p = (a[i], )
    integs[i] = Integrator(p, ks)
end
integs = CuArray([cudaconvert(integs[i]) for i=1:Np])


# ******************************************************************************
u = CUDA.zeros((Nu, Np))

nthreads = min(Np, 256)
nblocks = cld(Np, nthreads)
@cuda threads=nthreads blocks=nblocks kernel(u, integs)

@show u == 4 * CUDA.ones((Nu, Np))

If I run this code for consecutively for 10-20 times, non stopping error messages with the above error messages will start to appear.

My main questions are:

  • Is it a correct way to write the adaptor for such type of structures?
  • Can I use the constructions like integs = CuArray([cudaconvert(integs[i]) for i=1:Np]) to pass the array of structures to the kernel?
  • What is going on? Why do I see these errors randomly?

Thank you.

You cannot call cudaconvert yourself without preserving taking the lifetime of the original object. A CuDeviceArray is the device-counterpart, and doesn’t keep the parent CuArray alive. CUDA.jl takes care to preserve the parent when invoking a kernel, but if you call cudaconvert yourself you need to mark the region where the source array needs to be preserved using e.g. GC.@preserve, or by manually rooting it somewhere.

or by manually rooting it somewhere

Sorry, I do not understand. Do you mean by making a copy?

As far as I understood, the problem is here:

integs = CuArray([cudaconvert(integs[i]) for i=1:Np])

Can I solve the issue by writing an adaptor for the array of structures, similarly to what I did with the array of CuArrays in the code above? Something like this:

function Adapt.adapt_structure(to, integs::Vector{Integrator})
    integs = [Adapt.adapt_structure(to, integs[i]) for i=1:length(integs)]
    integs = CuArray(integs)
    return Adapt.adapt_structure(to, integs)
end

No, because we currently don’t convert the element type of arrays you pass to kernels. That could be very expensive (because every launch could then allocate), but as a consequence it requires you to perform the conversion yourself if you want to use arrays of arrays.

You want to keep the source array alive, not by making a copy, but by making sure the object isn’t collected by the GC. One solution is to do GC.@preserve integs begin ... end. See Essentials · The Julia Language you need to treat cudaconvert as unsafe_convert.

Thank you. I will try to play with it.

Does it also mean that the original adapter is also wrong? I mean, this one:

function Adapt.adapt_structure(to, integ::Integrator)
    ks = [Adapt.adapt_structure(to, integ.ks[i]) for i=1:length(integ.ks)]
    ks = CuArray(ks)
    ks = Adapt.adapt_structure(to, ks)
    return Integrator(integ.p, ks)
end

Then how I can even pass something to the kernel? The structure and any arrays of it will be non isbits.

Do you know any examples (maybe some packages) where someone works with arrays of arrays or arrays of structures inside CUDA kernels so I can learn from them?

The adaptor isn’t wrong, you just need to take care when calling it. CUDA.jl does that: CUDA.jl/execution.jl at 0fafe91e4971e94faf289ccc86aa0f68733a18d1 · JuliaGPU/CUDA.jl · GitHub

You can do the conversion yourself, you just need to make sure the source object isn’t collected by the GC.

Thank you very much!

Thank you, I have read the docs for GC.@preserve and this interesting discussion: On the garbage collection. I hope now I understand what causes the error in my case. However, it is not clear for me how I should use GC.@preserve in case of array of arrays. Here I prepared a much more simple MWE:

using CUDA

function kernel(a, bb)
    id = threadIdx().x
    a[id] = sum(bb[id])
    return nothing
end


N = 10

a = CUDA.zeros(N)

b = Array{CuArray}(undef, N)
for i=1:N
    b[i] = CUDA.ones(2)
end


# This potentially can cause an error
bb = CuArray([cudaconvert(b[i]) for i=1:N])
@cuda threads=N kernel(a, bb)

# Option 1:
bb = CuArray([cudaconvert(b[i]) for i=1:N])
GC.@preserve b begin
    @cuda threads=N kernel(a, bb)
end

# Option 2:
btmp = [cudaconvert(b[i]) for i=1:N]
bb = CuArray(btmp)
GC.@preserve btmp begin
    @cuda threads=N kernel(a, bb)
end

In this example I do not understand what exactly I should preserve: the original array of CuArrays b, the temporary array of CuDeviceArrays [cudaconvert(b[i]) for i=1:N], or both of them.

Sorry, it is very hard to debug such code, since GC manages to collect the original objects only if the kernel has been running long enough. Can I mimic the GC behaviour and cause the error by myself using e.g. something like finilize(b)?

The allocation is tied to CuArray, so you should preserve b across all uses of bb.

I do not know, maybe I am doing something wrong, but the solution with GC.@preserve does not work. On the latest simplified MWE I can not catch the issue at all. On the original MWE from the first post I replace the line

@cuda threads=nthreads blocks=nblocks kernel(u, integs)

by

GC.@preserve integs begin
    @cuda threads=nthreads blocks=nblocks kernel(u, integs)
end

and it does not help. I still see the errors

WARNING: Error while freeing DeviceBuffer(38.147 MiB at 0x0000000305629000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cc), meta=nothing)

ERROR: LoadError: CUDA error: misaligned address (code 716, ERROR_MISALIGNED_ADDRESS)

Please note that each structure in the CuArray of structures has an array of CuArrays as its field.
Should I also preserve it somehow?

That won’t work since you overwrite integs. You need to understand what the culprit is here: your original CuArrays are collected because no reference to them is live, as you overwrite integs (which used to contain instances) with pointers to those arrays (which don’t keep the array alive).

I also tried the following:

cuintegs = CuArray([cudaconvert(integs[i]) for i=1:Np])

GC.@preserve integs begin
    @cuda threads=nthreads blocks=nblocks kernel(u, cuintegs)
end

Unfortunately, it does not help.

As far as I understand, when the function cudaconvert(integs[i]) is called it also calls the corresponding adaptor:

function Adapt.adapt_structure(to, integ::Integrator)
    ks = [Adapt.adapt(to, integ.ks[i]) for i=1:length(integ.ks)]
    ks = CuArray(ks)
    ks = Adapt.adapt_structure(to, ks)
    return Integrator(integ.p, ks)
end

Inside this adaptor I do the same trick with the ks array of CuArrays which is the field of each integs[i] structure. According to your explanation, here I have the same issue when I transform integ.ks to CuArray ks. However, here it is not clear how to preserve integ.ks.

Yes, that’s likely the culprit. It’s not really OK to create a CuArray in an adaptor. Adapt’s structural rules (i.e. methods of the adapt_structure function) are intended to be separate from the storage-specific operations (e.g. creating a CuDeviceArray from a CuArray, or creating a CuArray from an Array). You should make it so that your Integrator already contains CuArrays before calling the kernel, and you can use Adapt to do so (e.g. have that adapt_structure method call adapt(to, ks) instead of the CuArray constructor, and then calling adapt(CuArray, integ) in your application). Launching the kernel will then take care of the final conversion to CuDeviceArray.

Sorry, but I do not understand how it should help. My ks originally is the Array of CuArrays:

ks = [CUDA.zeros(Nu) for i=1:4]

The only way which I know how to covert it to CuArray is the following:

ks = CuArray([cudaconvert(ks[i]) for i=1:length(ks)])

Right now I do this conversion in the adaptor. Ok, let’s move it outside the adaptor and use the obtained CuArray of CuDeviceArrays to create integ. Then indeed I will be able to write a simple adaptor:

function Adapt.adapt_structure(to, integ::Integrator)
    ks = Adapt.adapt(to, integ.ks)
    return Integrator(integ.p, ks)
end

But how it will protect the original ks array from being garbage collected? With this approach I just relocate the point of failure.

using CUDA, Adapt

struct Integrator{P, K}
    p :: P   # Tuple
    ks :: K   # Array of CuArrays
end
Integrator(p::P, ks::K) where {P, K} = Integrator{P,K}(p, ks)

Adapt.adapt_structure(to, integ::Integrator) =
    Integrator(adapt(to, integ.p), adapt(to, integ.ks))

function kernel(u, integs)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    Nu, Np = size(u)

    for ip=id:stride:Np
        ks = integs[ip].ks

        for iu=1:Nu
        for i=1:4
            u[iu, ip] += ks[i][iu] + 1
        end
        end
    end
    return nothing
end

function main()
    Nu = 100
    ks = [CUDA.zeros(Nu) for i=1:4]

    Np = 10000
    a = range(1f0, 3f0, length=Np)

    # manual conversion of the CuArrays elements of Integrator.ks
    GC.@preserve ks begin
        ks = cudaconvert.(ks)
        integs = Array{Integrator}(undef, Np)
        for i=1:Np
            p = (a[i], )
            integs[i] = Integrator(p, CuArray(ks))
        end

        # manual conversion of the Integrator elements
        GC.@preserve integs begin
            integs = CuArray(cudaconvert.(integs))

            u = CUDA.zeros((Nu, Np))

            nthreads = min(Np, 256)
            nblocks = cld(Np, nthreads)
            @cuda threads=nthreads blocks=nblocks kernel(u, integs)

            @show u == 4 * CUDA.ones((Nu, Np))
        end
    end
end

There’s no clean way to do automatic conversion of contained elements, so you should avoid writing your application like that.

Thank you very much for the detailed answer.

This is very sad, since it does not allow implementing more complex logic in CUDA kernels. Do you think it will be solved in future or it is an unsolvable issue? Does this problem exist in CUDA C as well?

By the way, will it help to pack ks into StaticArray in order to avoid writing GC.@preserve ks?

Yes, StaticArrays live on the stack and do not need any tracking.

The problem is that there just isn’t a clear solution; it’s not a technical restriction. Let’s simplify the problem to an array of arrays. Currently, @cuda doesn’t automatically convert CPU to GPU structures, which seems like a sane behavior (or otherwise launching kernels might become unexpectedly expensive, see below). So the user would need to instantiate a CuArray of CuArrays. But CuArrays live on the GPU, so the CPU GC doesn’t see them, which means the CuArray elements might get freed early. Furthermore, CuArrays are not GPU compatible, they need to be converted to a CuDeviceArray, which means that launching a kernel would need to copy the outer CuArray back to the CPU, convert the CuArray elements to CuDeviceArray, allocate a new CuArray{CuDeviceArray} and convert that to a CuDeviceArray{CuDeviceArray}. Again, this would make kernel launching prohibitively expensive.

So the only option left is to make it possible to invoke kernels with Array{CuArray} (which can track the elements’ lifetime, as well as do the conversion directly without copying from the GPU). But then we’re at the start again, where launching a kernel may require any number of allocations (for the outer CuArray) and memory copies (the outer Array → CuArray), more than one in the case where CuArrays occur deeper in the hierarchy as in your example here (where it would need to allocate at least one CuArray per integrator in your array; that would absolutely kill performance). All this is why I haven’t bothered much looking into this. Users probably know best which data structures could be pre-allocated on the GPU anyhow.

Thank you. I hope I understood.

What is the most annoying with the current solution is that I have to explicitly define the area (with begin ... end) where GC.@preserve will preserve the variables. As a result, I can not create a variable somewhere away of the place where I plan to use it.

About the StaticArrays. What I mean is not StaticArray of StaticArrays. I think of converting the original Array of CuArrays to StaticArray of CuArrays or StaticArray of CuDeviceArrays. Will it work?

You can. GC.@preserve is one way of making a sure a variable is not collected. If you can keep a reference in another way (crude example: push! it to a global Any[]), the GC won’t collect it either.

Thank you. And sorry for so many questions.