Strange behavior of CuDeviceArrays

fedoroff · June 8, 2021, 9:30am

I am playing around with passing of custom structures containing arrays into CUDA kernels. In order to be able to access these structure’s arrays inside the kernels, I convert them to CuDeviceArrays using cudaconvert function. However, sometimes, I want to see their content. Unfortunately, Base.collect function does not work with CuDeviceArrays and I can not explicitly extract their content from the device.

I decided to write my own implementation of Base.collect. The idea is to create an empty CuArray, call a CUDA kernel where I transfer the content of CuDeviceArray to the CuArray, element by element, and, finally, collect the resulting CuArray. The code below shows my implementation.

Surprisingly, the obtained results depends on preceding history. The collect function returns the desired content of the CuDeviceArray only if previously I create another CuDeviceArray (which is later never used). If I want to collect CuDeviceArray multiple times, I have to create preventively additional set of CuDeviceArrays.

Can you please explain me what is going on?

using CUDA


function set_kernel(dest, src)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(src)
        dest[i] = src[i]
    end
    return nothing
end


function Base.collect(src::CuDeviceArray)
    dest = CUDA.zeros(size(src))

    N = length(src)
    nthreads = min(N, 256)
    nblocks = cld(N, nthreads)
    @cuda threads=nthreads blocks=nblocks set_kernel(dest, src)

    return collect(dest)
end


# xxx = cudaconvert(CUDA.zeros(1))
# yyy = cudaconvert(CUDA.zeros(1))
# zzz = cudaconvert(CUDA.zeros(1))

a = cudaconvert(CUDA.ones(10))

@show collect(a)
@show collect(a)
@show collect(a)


# xxx, yyy, zzz are commented:
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

# xxx is uncommented; yyy, zzz are commented:
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

# xxx, yyy are uncommented; zzz is commented:
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

# xxx, yyy, zzz are uncommented:
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# collect(a) = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

maleadt · June 8, 2021, 10:06am

Just use the original CuArray then? CuDeviceArrays aren’t meant to exist on the CPU, for one, because they don’t keep the original buffer alive. You should just work with the CuArray that backs the CuDeviceArray, and treat the conversion to a GPU-compatible counterpart as an implementation detail.

fedoroff · June 8, 2021, 12:36pm

This is an example of the use case:

using CUDA

struct Foo{T}
    x :: T
end

function kernel(y, src)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(y)
        y[i] = src.x[i]
    end
    return nothing
end

N = 10
y = CUDA.zeros(N)

a = Foo(CUDA.ones(N))
# @cuda threads=N kernel(y, a)
# KernelError: passing and using non-bitstype argument

b = Foo(cudaconvert(CUDA.ones(N)))
@cuda threads=N kernel(y, b)
@show y
# y = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

With manually converted CuDeviceArrays it is possible to pass custom structures into kernels. Please tell me if there are other methods to do the same.
If later in the code I wil want to access x field of Foo, I will have a read only memory error.

maleadt · June 8, 2021, 1:01pm

Yes, use Adapt:

julia> using Adapt

julia> Adapt.adapt_structure(to, foo::Foo) = Foo(adapt(to, foo.x))

julia> @cuda threads=N kernel(y, a)
CUDA.HostKernel{typeof(kernel), Tuple{CuDeviceVector{Float32, 1}, Foo{CuDeviceVector{Float32, 1}}}}(kernel, CuContext(0x000000000198a0c0, instance 3d8c8ee076d40ff0), CuModule(Ptr{Nothing} @0x0000000005b27d90, CuContext(0x000000000198a0c0, instance 3d8c8ee076d40ff0)), CuFunction(Ptr{Nothing} @0x00000000087e61d0, CuModule(Ptr{Nothing} @0x0000000005b27d90, CuContext(0x000000000198a0c0, instance 3d8c8ee076d40ff0))))

julia> @show y
y = Float32[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

fedoroff · June 8, 2021, 1:06pm

Wow! Perfect! Thank you.

fedoroff · June 9, 2021, 9:45am

Sorry, one more question.
How I can pass an array of arrays into the CUDA kernel without manual cudaconvert?
The illustrating example is the following:

using CUDA

function kernel(dest, src)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(dest)
        dest[i] = src[1][i] + src[2][i]
    end
    return nothing
end


N = 10

out = CUDA.zeros(N)

A = [CUDA.ones(N), CUDA.ones(N)]
# @cuda threads=N kernel(out, A)   # passing and using non-bitstype argument
# Ad = CuArray(A)   # CuArray only supports bits types
# @cuda threads=N kernel(out, Ad)

B = [cudaconvert(CUDA.ones(N)), cudaconvert(CUDA.ones(N))]
Bd = CuArray(B)
@cuda threads=N kernel(out, Bd)

@show out   # out = Float32[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0]

I need it when I try to solve a system of differential equations on GPU using Runge-Kutta methods. In this case I have the array of arrays - vector ks containing temporary arrays of length equal to the number of the system equations.

maleadt · June 10, 2021, 11:13am

Array element types aren’t converted, only the array itself, so you’d end up with a CuDeviceArray of CuArrays which won’t work. You need to call cudaconvert on the inner array manually right now.

fedoroff · June 10, 2021, 11:22am

OK. Thank you.

Topic		Replies	Views
Arrays of arrays and arrays of structures in CUDA kernels cause random errors GPU gpu , cuda	21	2810	October 21, 2021
Store CuArrays on a mutable struct? GPU	5	1422	July 2, 2018
Passing a wrapped array to a kernel GPU	2	539	May 27, 2020
Are there any way to copy a Dict or a custom datatype which contains an Array to GPU by CUDA? GPU question	6	1719	November 17, 2020
Allocating different arrays on multiple GPUs GPU	8	1045	September 30, 2021

Strange behavior of CuDeviceArrays

Related topics