Basically I am going to transform mapslices() using kernels. This function slices each layer from a 3D array and multiplies each layer by the same matrix, like below
A = rand(4, 4, 3) B = rand(4, 4) C = similar(A) C = mapslices(x -> x * B, A, dims = [1, 2])
But when I tried to write the kernel
using CUDA A = CUDA.rand(4, 4, 3) B = CUDA.rand(4, 4) C = similar(A) function kernel!(C, A, B) i = threadIdx().x if (i <= size(A, 3)) @inbounds C[:, :, i] = A[:, :, i] * B end return nothing end @cuda threads = size(A, 3) kernel!(C, A, B)
When I tried running this kernel, the Julia REPL unexpectedly quit and did not show any results. I attempted to run the kernels with lower dimensional arrays, but the result showed errors. I suspect that accessing a column, row, or layer might not be good in the kernel, as opposed to accessing a single element.
Am I right? If so, are there other ways to execute
mapslices() from my example in parallel? I’ve considered multithreading, but it seems to work serially, not in parallel.