Arrays of arrays and arrays of structures in CUDA kernels cause random errors

Hello,

I want to pass an array of structures to CUDA kernel. Each structure contains an array of CuArrays.
Most of the time my code works well, but sometimes randomly I see errors of the following type:

ERROR: Out-of-bounds array access.
WARNING: Error while freeing DeviceBuffer(96 bytes at ....)
LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

The MWE where I managed to observe this behaviour is the following:

using Adapt
using CUDA


struct Integrator{P, K}
    p :: P   # Tuple
    ks :: K   # Array of CuArrays
end


function Adapt.adapt_structure(to, integ::Integrator)
    ks = [Adapt.adapt_structure(to, integ.ks[i]) for i=1:length(integ.ks)]
    ks = CuArray(ks)
    ks = Adapt.adapt_structure(to, ks)
    return Integrator(integ.p, ks)
end


function kernel(u, integs)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    Nu, Np = size(u)

    for ip=id:stride:Np
        ks = integs[ip].ks

        for iu=1:Nu
        for i=1:4
            u[iu, ip] += ks[i][iu] + 1
        end
        end
    end
    return nothing
end


# ******************************************************************************
Nu = 100
ks = [CUDA.zeros(Nu) for i=1:4]

Np = 10000
a = range(1f0, 3f0, length=Np)

integs = Array{Integrator}(undef, Np)
for i=1:Np
    p = (a[i], )
    integs[i] = Integrator(p, ks)
end
integs = CuArray([cudaconvert(integs[i]) for i=1:Np])


# ******************************************************************************
u = CUDA.zeros((Nu, Np))

nthreads = min(Np, 256)
nblocks = cld(Np, nthreads)
@cuda threads=nthreads blocks=nblocks kernel(u, integs)

@show u == 4 * CUDA.ones((Nu, Np))

If I run this code for consecutively for 10-20 times, non stopping error messages with the above error messages will start to appear.

My main questions are:

  • Is it a correct way to write the adaptor for such type of structures?
  • Can I use the constructions like integs = CuArray([cudaconvert(integs[i]) for i=1:Np]) to pass the array of structures to the kernel?
  • What is going on? Why do I see these errors randomly?

Thank you.

You cannot call cudaconvert yourself without preserving taking the lifetime of the original object. A CuDeviceArray is the device-counterpart, and doesn’t keep the parent CuArray alive. CUDA.jl takes care to preserve the parent when invoking a kernel, but if you call cudaconvert yourself you need to mark the region where the source array needs to be preserved using e.g. GC.@preserve, or by manually rooting it somewhere.

or by manually rooting it somewhere

Sorry, I do not understand. Do you mean by making a copy?

As far as I understood, the problem is here:

integs = CuArray([cudaconvert(integs[i]) for i=1:Np])

Can I solve the issue by writing an adaptor for the array of structures, similarly to what I did with the array of CuArrays in the code above? Something like this:

function Adapt.adapt_structure(to, integs::Vector{Integrator})
    integs = [Adapt.adapt_structure(to, integs[i]) for i=1:length(integs)]
    integs = CuArray(integs)
    return Adapt.adapt_structure(to, integs)
end

No, because we currently don’t convert the element type of arrays you pass to kernels. That could be very expensive (because every launch could then allocate), but as a consequence it requires you to perform the conversion yourself if you want to use arrays of arrays.

You want to keep the source array alive, not by making a copy, but by making sure the object isn’t collected by the GC. One solution is to do GC.@preserve integs begin ... end. See Essentials · The Julia Language you need to treat cudaconvert as unsafe_convert.

Thank you. I will try to play with it.

Does it also mean that the original adapter is also wrong? I mean, this one:

function Adapt.adapt_structure(to, integ::Integrator)
    ks = [Adapt.adapt_structure(to, integ.ks[i]) for i=1:length(integ.ks)]
    ks = CuArray(ks)
    ks = Adapt.adapt_structure(to, ks)
    return Integrator(integ.p, ks)
end

Then how I can even pass something to the kernel? The structure and any arrays of it will be non isbits.

Do you know any examples (maybe some packages) where someone works with arrays of arrays or arrays of structures inside CUDA kernels so I can learn from them?

The adaptor isn’t wrong, you just need to take care when calling it. CUDA.jl does that: CUDA.jl/execution.jl at 0fafe91e4971e94faf289ccc86aa0f68733a18d1 · JuliaGPU/CUDA.jl · GitHub

You can do the conversion yourself, you just need to make sure the source object isn’t collected by the GC.

Thank you very much!

Thank you, I have read the docs for GC.@preserve and this interesting discussion: On the garbage collection. I hope now I understand what causes the error in my case. However, it is not clear for me how I should use GC.@preserve in case of array of arrays. Here I prepared a much more simple MWE:

using CUDA

function kernel(a, bb)
    id = threadIdx().x
    a[id] = sum(bb[id])
    return nothing
end


N = 10

a = CUDA.zeros(N)

b = Array{CuArray}(undef, N)
for i=1:N
    b[i] = CUDA.ones(2)
end


# This potentially can cause an error
bb = CuArray([cudaconvert(b[i]) for i=1:N])
@cuda threads=N kernel(a, bb)

# Option 1:
bb = CuArray([cudaconvert(b[i]) for i=1:N])
GC.@preserve b begin
    @cuda threads=N kernel(a, bb)
end

# Option 2:
btmp = [cudaconvert(b[i]) for i=1:N]
bb = CuArray(btmp)
GC.@preserve btmp begin
    @cuda threads=N kernel(a, bb)
end

In this example I do not understand what exactly I should preserve: the original array of CuArrays b, the temporary array of CuDeviceArrays [cudaconvert(b[i]) for i=1:N], or both of them.

Sorry, it is very hard to debug such code, since GC manages to collect the original objects only if the kernel has been running long enough. Can I mimic the GC behaviour and cause the error by myself using e.g. something like finilize(b)?

The allocation is tied to CuArray, so you should preserve b across all uses of bb.

I do not know, maybe I am doing something wrong, but the solution with GC.@preserve does not work. On the latest simplified MWE I can not catch the issue at all. On the original MWE from the first post I replace the line

@cuda threads=nthreads blocks=nblocks kernel(u, integs)

by

GC.@preserve integs begin
    @cuda threads=nthreads blocks=nblocks kernel(u, integs)
end

and it does not help. I still see the errors

WARNING: Error while freeing DeviceBuffer(38.147 MiB at 0x0000000305629000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002cc), meta=nothing)

ERROR: LoadError: CUDA error: misaligned address (code 716, ERROR_MISALIGNED_ADDRESS)

Please note that each structure in the CuArray of structures has an array of CuArrays as its field.
Should I also preserve it somehow?

That won’t work since you overwrite integs. You need to understand what the culprit is here: your original CuArrays are collected because no reference to them is live, as you overwrite integs (which used to contain instances) with pointers to those arrays (which don’t keep the array alive).

I also tried the following:

cuintegs = CuArray([cudaconvert(integs[i]) for i=1:Np])

GC.@preserve integs begin
    @cuda threads=nthreads blocks=nblocks kernel(u, cuintegs)
end

Unfortunately, it does not help.

As far as I understand, when the function cudaconvert(integs[i]) is called it also calls the corresponding adaptor:

function Adapt.adapt_structure(to, integ::Integrator)
    ks = [Adapt.adapt(to, integ.ks[i]) for i=1:length(integ.ks)]
    ks = CuArray(ks)
    ks = Adapt.adapt_structure(to, ks)
    return Integrator(integ.p, ks)
end

Inside this adaptor I do the same trick with the ks array of CuArrays which is the field of each integs[i] structure. According to your explanation, here I have the same issue when I transform integ.ks to CuArray ks. However, here it is not clear how to preserve integ.ks.

Yes, that’s likely the culprit. It’s not really OK to create a CuArray in an adaptor. Adapt’s structural rules (i.e. methods of the adapt_structure function) are intended to be separate from the storage-specific operations (e.g. creating a CuDeviceArray from a CuArray, or creating a CuArray from an Array). You should make it so that your Integrator already contains CuArrays before calling the kernel, and you can use Adapt to do so (e.g. have that adapt_structure method call adapt(to, ks) instead of the CuArray constructor, and then calling adapt(CuArray, integ) in your application). Launching the kernel will then take care of the final conversion to CuDeviceArray.