Strange behavior of CuDeviceArrays

I am playing around with passing of custom structures containing arrays into CUDA kernels. In order to be able to access these structure’s arrays inside the kernels, I convert them to CuDeviceArrays using cudaconvert function. However, sometimes, I want to see their content. Unfortunately, Base.collect function does not work with CuDeviceArrays and I can not explicitly extract their content from the device.

I decided to write my own implementation of Base.collect. The idea is to create an empty CuArray, call a CUDA kernel where I transfer the content of CuDeviceArray to the CuArray, element by element, and, finally, collect the resulting CuArray. The code below shows my implementation.

Surprisingly, the obtained results depends on preceding history. The collect function returns the desired content of the CuDeviceArray only if previously I create another CuDeviceArray (which is later never used). If I want to collect CuDeviceArray multiple times, I have to create preventively additional set of CuDeviceArrays.

Can you please explain me what is going on?

using CUDA


function set_kernel(dest, src)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(src)
        dest[i] = src[i]
    end
    return nothing
end


function Base.collect(src::CuDeviceArray)
    dest = CUDA.zeros(size(src))

    N = length(src)
    nthreads = min(N, 256)
    nblocks = cld(N, nthreads)
    @cuda threads=nthreads blocks=nblocks set_kernel(dest, src)

    return collect(dest)
end


# xxx = cudaconvert(CUDA.zeros(1))
# yyy = cudaconvert(CUDA.zeros(1))
# zzz = cudaconvert(CUDA.zeros(1))

a = cudaconvert(CUDA.ones(10))

@show collect(a)
@show collect(a)
@show collect(a)


# xxx, yyy, zzz are commented:
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

# xxx is uncommented; yyy, zzz are commented:
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

# xxx, yyy are uncommented; zzz is commented:
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

# xxx, yyy, zzz are uncommented:
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Just use the original CuArray then? CuDeviceArrays aren’t meant to exist on the CPU, for one, because they don’t keep the original buffer alive. You should just work with the CuArray that backs the CuDeviceArray, and treat the conversion to a GPU-compatible counterpart as an implementation detail.

This is an example of the use case:

using CUDA

struct Foo{T}
    x :: T
end

function kernel(y, src)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(y)
        y[i] = src.x[i]
    end
    return nothing
end

N = 10
y = CUDA.zeros(N)

a = Foo(CUDA.ones(N))
# @cuda threads=N kernel(y, a)
# KernelError: passing and using non-bitstype argument

b = Foo(cudaconvert(CUDA.ones(N)))
@cuda threads=N kernel(y, b)
@show y
# y = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

With manually converted CuDeviceArrays it is possible to pass custom structures into kernels. Please tell me if there are other methods to do the same.
If later in the code I wil want to access x field of Foo, I will have a read only memory error.

Yes, use Adapt:

julia> using Adapt

julia> Adapt.adapt_structure(to, foo::Foo) = Foo(adapt(to, foo.x))

julia> @cuda threads=N kernel(y, a)
CUDA.HostKernel{typeof(kernel), Tuple{CuDeviceVector{Float32, 1}, Foo{CuDeviceVector{Float32, 1}}}}(kernel, CuContext(0x000000000198a0c0, instance 3d8c8ee076d40ff0), CuModule(Ptr{Nothing} @0x0000000005b27d90, CuContext(0x000000000198a0c0, instance 3d8c8ee076d40ff0)), CuFunction(Ptr{Nothing} @0x00000000087e61d0, CuModule(Ptr{Nothing} @0x0000000005b27d90, CuContext(0x000000000198a0c0, instance 3d8c8ee076d40ff0))))

julia> @show y
y = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Wow! Perfect! Thank you.

Sorry, one more question.
How I can pass an array of arrays into the CUDA kernel without manual cudaconvert?
The illustrating example is the following:

using CUDA

function kernel(dest, src)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(dest)
        dest[i] = src[1][i] + src[2][i]
    end
    return nothing
end


N = 10

out = CUDA.zeros(N)

A = [CUDA.ones(N), CUDA.ones(N)]
# @cuda threads=N kernel(out, A)   # passing and using non-bitstype argument
# Ad = CuArray(A)   # CuArray only supports bits types
# @cuda threads=N kernel(out, Ad)

B = [cudaconvert(CUDA.ones(N)), cudaconvert(CUDA.ones(N))]
Bd = CuArray(B)
@cuda threads=N kernel(out, Bd)

@show out   # out = Float32[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0]

I need it when I try to solve a system of differential equations on GPU using Runge-Kutta methods. In this case I have the array of arrays - vector ks containing temporary arrays of length equal to the number of the system equations.

Array element types aren’t converted, only the array itself, so you’d end up with a CuDeviceArray of CuArrays which won’t work. You need to call cudaconvert on the inner array manually right now.

OK. Thank you.