How to use `copy!`(or other array operations) inside a CUDA kernel?

I want to copy a column of matrix Y to another matrix X inside a custom kernel in ONE thread.

X = CUDA.rand(10, 5)
Y = CUDA.rand(10, 5);

The following possible methods have been tried.
- Method 1: error

function kernel_copy1(X, Y)
    X[:, 1] .= @view Y[:, 1]

@cuda kernel_copy1(X, Y)


LoadError: InvalidIRError: compiling kernel kernel_copy1(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_object_id_)
[1] objectid
@ reflection.jl:291
...(quite long information)

- Method 2: error

function kernel_copy2(X, Y)
    @views copy!(X[:, 1], Y[:, 1])

@cuda kernel_copy2(X, Y)


LoadError: InvalidIRError: compiling kernel kernel_copy2(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to resize!)
[1] copy!
@ abstractarray.jl:818
[2] kernel_copy2
@ In[11]:2
Reason: unsupported dynamic function invocation (call to throw_eachindex_mismatch_indices(::IndexLinear, inds...) in Base at abstractarray.jl:259)
[1] eachindex

- Method 3: for loop works

function kernel_copy3(X, Y)
    for i = 1:size(X, 1)
        X[i, 1] = Y[i, 1]

@cuda kernel_copy3(X, Y)
  • Is there any alternative way to perform the above copy instead of writing our own for-loop?

  • Does it mean that, in principle, only scalar operations are allowed in a CUDA kernel? For instance, a similar operation may be maximum: how can we use maximum(X[:, 1]) in the kernel? In CUDA/C++, we of course usually write a for loop, but I am wondering whether high-level array functions are applicable in CUDA.jl when writing a kernel.

(PS: I know that we can write X[:, 1] .= @view Y[:, 1] directly outside a kernel (which launches a kernel implicitly), but the above code snippet is just a small portion of a custom kernel.)

1 Like

Few array operations are implemented for the CuDeviceArray type (the device counterpart of CuArray), and kernels are generally expected to perform scalar operations. I wouldn’t be against adding additional vectorized operations for CuDeviceArrays, but with how GPUs work (warps execute in lockstep, thread divergence, etc) it’s easy to shoot yourself in the foot by assuming that array operations in kernels benefit from the GPUs parallelism, or worse, doing so in a thread-divergent fashion.


Thanks for the clarification. The vectorized operation may be useful in coarse-grained parallelism on the GPU. For instance, in particle swarm optimization, each particle may be deployed to one thread, and the operations on a single particle (typically encoded as a vector) involve some for-loops, which can be easily written with high-level Julia functions like the copy! and maximum above.

Of course, I understand the current design of CUDA.jl. For now, we just need to write some small utility functions like my own _copy! in such cases.

Note that there is some level of broadcasting that works in kernels if you use StaticArrays, e.g.

Y = CUDA.rand(10,2)
function kernel(Y)
    X = @SVector zeros(10)
    Y[:,1] .= X .+ 1

@cuda kernel(Y)

although I couldn’t get exactly your example working. Based on:

maybe its possible? Perhaps @mateuszbaran can comment further, I’m curious myself too.

It’s definitely possible, just Y has to have its first dimension statically known (using either SizedArray or HybridArray). There is a small problem with taking a view of Y in broadcasting (that 1 in indexing causes bounds checking which is bad on GPU so @inbounds is needed but it’s currently not propagated correctly). I’ll take a closer look at it tomorrow.

1 Like

An interesting update. @mateuszbaran @marius311

The copy! works on one-dim CuVector inside a kernel (but not two-dim matrix).

function kernel_copy_vector(x, y)
    @views copy!(x[1:3], y[2:4])

x = CUDA.rand(5)
y = CUDA.rand(5)
@cuda kernel_copy_vector(x, y)

@show x y


x = Float32[0.5914948, 0.4371322, 0.7592222, 0.6038771, 0.23455298]
y = Float32[0.49029452, 0.5914948, 0.4371322, 0.7592222, 0.39804643]

which is indeed a correct result.

A small update. It’s more rough that I thought. Good news is, this works:

using StaticArrays, HybridArrays, CUDA
X = CUDA.rand(10, 5)
Y = CUDA.rand(10, 5)
XH = randn(10, 5)
YH = randn(10, 5)

function kernel_copy1(X, Y)
    XS = HybridMatrix{10,5}(X)
    YS = HybridMatrix{10,5}(Y)
    @inbounds XS[:, 1] .= view(YS, :, 1)

@cuda kernel_copy1(X, Y)

That is it works after adapting @inbounds propagation (which IMO should be done anyway) and removing size checks from the HybridMatrix constructor (which is not ideal but maybe worth it). I don’t know why I can’t pass a HybridMatrix directly to the kernel, it’s a very simple wrapper.

This is, of course, assuming your array is very small and static broadcasting makes sense. If not, I can’t really help with adapting non-static broadcasting to GPU.


CUDA.jl uses Adapt.jl to convert wrappers. If Adapt doesn’t know about HybridMatrix, which it doesn’t (the package should define Adapt rules), you’ll end up with a CuArray on the device instead of a CuDeviceArray.

1 Like