I want to copy a column of matrix Y
to another matrix X
inside a custom kernel in ONE thread.
X = CUDA.rand(10, 5)
Y = CUDA.rand(10, 5);
The following possible methods have been tried.
 Method 1: error
function kernel_copy1(X, Y)
X[:, 1] .= @view Y[:, 1]
nothing
end
@cuda kernel_copy1(X, Y)
Error:
LoadError: InvalidIRError: compiling kernel kernel_copy1(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_object_id_)
Stacktrace:
[1] objectid
@ reflection.jl:291
...(quite long information)
 Method 2: error
function kernel_copy2(X, Y)
@views copy!(X[:, 1], Y[:, 1])
nothing
end
@cuda kernel_copy2(X, Y)
Error:
LoadError: InvalidIRError: compiling kernel kernel_copy2(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to resize!)
Stacktrace:
[1] copy!
@ abstractarray.jl:818
[2] kernel_copy2
@ In[11]:2
Reason: unsupported dynamic function invocation (call to throw_eachindex_mismatch_indices(::IndexLinear, inds...) in Base at abstractarray.jl:259)
Stacktrace:
[1] eachindex
...
 Method 3: for loop works
function kernel_copy3(X, Y)
for i = 1:size(X, 1)
X[i, 1] = Y[i, 1]
end
nothing
end
@cuda kernel_copy3(X, Y)

Is there any alternative way to perform the above copy instead of writing our own forloop?

Does it mean that, in principle, only scalar operations are allowed in a CUDA kernel? For instance, a similar operation may be
maximum
: how can we usemaximum(X[:, 1])
in the kernel? In CUDA/C++, we of course usually write a for loop, but I am wondering whether highlevel array functions are applicable in CUDA.jl when writing a kernel.
(PS: I know that we can write X[:, 1] .= @view Y[:, 1]
directly outside a kernel (which launches a kernel implicitly), but the above code snippet is just a small portion of a custom kernel.)