How to use `copy!`(or other array operations) inside a CUDA kernel?

Shuhua · February 22, 2021, 7:02am

I want to copy a column of matrix Y to another matrix X inside a custom kernel in ONE thread.

X = CUDA.rand(10, 5)
Y = CUDA.rand(10, 5);

The following possible methods have been tried.
- Method 1: error

function kernel_copy1(X, Y)
    X[:, 1] .= @view Y[:, 1]
    nothing
end

@cuda kernel_copy1(X, Y)

Error:

LoadError: InvalidIRError: compiling kernel kernel_copy1(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_object_id_)
Stacktrace:
[1] objectid
@ reflection.jl:291
...(quite long information)

- Method 2: error

function kernel_copy2(X, Y)
    @views copy!(X[:, 1], Y[:, 1])
    nothing
end

@cuda kernel_copy2(X, Y)

Error:

LoadError: InvalidIRError: compiling kernel kernel_copy2(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to resize!)
Stacktrace:
[1] copy!
@ abstractarray.jl:818
[2] kernel_copy2
@ In[11]:2
Reason: unsupported dynamic function invocation (call to throw_eachindex_mismatch_indices(::IndexLinear, inds...) in Base at abstractarray.jl:259)
Stacktrace:
[1] eachindex
...

- Method 3: for loop works

function kernel_copy3(X, Y)
    for i = 1:size(X, 1)
        X[i, 1] = Y[i, 1]
    end
    nothing
end

@cuda kernel_copy3(X, Y)

Is there any alternative way to perform the above copy instead of writing our own for-loop?
Does it mean that, in principle, only scalar operations are allowed in a CUDA kernel? For instance, a similar operation may be maximum: how can we use maximum(X[:, 1]) in the kernel? In CUDA/C++, we of course usually write a for loop, but I am wondering whether high-level array functions are applicable in CUDA.jl when writing a kernel.

(PS: I know that we can write X[:, 1] .= @view Y[:, 1] directly outside a kernel (which launches a kernel implicitly), but the above code snippet is just a small portion of a custom kernel.)

maleadt · February 22, 2021, 12:50pm

Few array operations are implemented for the CuDeviceArray type (the device counterpart of CuArray), and kernels are generally expected to perform scalar operations. I wouldn’t be against adding additional vectorized operations for CuDeviceArrays, but with how GPUs work (warps execute in lockstep, thread divergence, etc) it’s easy to shoot yourself in the foot by assuming that array operations in kernels benefit from the GPUs parallelism, or worse, doing so in a thread-divergent fashion.

Shuhua · February 22, 2021, 3:47pm

Thanks for the clarification. The vectorized operation may be useful in coarse-grained parallelism on the GPU. For instance, in particle swarm optimization, each particle may be deployed to one thread, and the operations on a single particle (typically encoded as a vector) involve some for-loops, which can be easily written with high-level Julia functions like the copy! and maximum above.

Of course, I understand the current design of CUDA.jl. For now, we just need to write some small utility functions like my own _copy! in such cases.

marius311 · February 22, 2021, 9:54pm

Note that there is some level of broadcasting that works in kernels if you use StaticArrays, e.g.

Y = CUDA.rand(10,2)
 
function kernel(Y)
    X = @SVector zeros(10)
    Y[:,1] .= X .+ 1
    nothing
end

@cuda kernel(Y)

although I couldn’t get exactly your example working. Based on:

maybe its possible? Perhaps @mateuszbaran can comment further, I’m curious myself too.

mateuszbaran · February 22, 2021, 10:20pm

It’s definitely possible, just Y has to have its first dimension statically known (using either SizedArray or HybridArray). There is a small problem with taking a view of Y in broadcasting (that 1 in indexing causes bounds checking which is bad on GPU so @inbounds is needed but it’s currently not propagated correctly). I’ll take a closer look at it tomorrow.

Shuhua · February 23, 2021, 2:44am

An interesting update. @mateuszbaran @marius311

The copy! works on one-dim CuVector inside a kernel (but not two-dim matrix).

function kernel_copy_vector(x, y)
    @views copy!(x[1:3], y[2:4])
    nothing
end

x = CUDA.rand(5)
y = CUDA.rand(5)
@cuda kernel_copy_vector(x, y)

@show x y

gives

x = Float32[0.5914948, 0.4371322, 0.7592222, 0.6038771, 0.23455298]
y = Float32[0.49029452, 0.5914948, 0.4371322, 0.7592222, 0.39804643]

which is indeed a correct result.

mateuszbaran · February 23, 2021, 6:32pm

A small update. It’s more rough that I thought. Good news is, this works:

using StaticArrays, HybridArrays, CUDA
X = CUDA.rand(10, 5)
Y = CUDA.rand(10, 5)
XH = randn(10, 5)
YH = randn(10, 5)

function kernel_copy1(X, Y)
    XS = HybridMatrix{10,5}(X)
    YS = HybridMatrix{10,5}(Y)
    @inbounds XS[:, 1] .= view(YS, :, 1)
    nothing
end

@cuda kernel_copy1(X, Y)

That is it works after adapting @inbounds propagation (which IMO should be done anyway) and removing size checks from the HybridMatrix constructor (which is not ideal but maybe worth it). I don’t know why I can’t pass a HybridMatrix directly to the kernel, it’s a very simple wrapper.

This is, of course, assuming your array is very small and static broadcasting makes sense. If not, I can’t really help with adapting non-static broadcasting to GPU.

maleadt · February 23, 2021, 9:34pm

CUDA.jl uses Adapt.jl to convert wrappers. If Adapt doesn’t know about HybridMatrix, which it doesn’t (the package should define Adapt rules), you’ll end up with a CuArray on the device instead of a CuDeviceArray.

Topic		Replies	Views
Problems copying data on the GPU GPU	8	1269	October 1, 2020
Yet another broadcast question General Usage first-steps	10	757	November 15, 2018
Broadcasting using slices on the GPU GPU gpu , broadcasting , views	1	878	October 25, 2021
Copyto! does not work for subarray when scalar get index is disallowed? GPU question	1	473	October 18, 2019
Modifying CuArray elements in-place with index arrays GPU indexing	4	1746	October 28, 2020

How to use `copy!`(or other array operations) inside a CUDA kernel?

Related topics