How to do mapslices() in parallel for 3D arrays

Basically I am going to transform mapslices() using kernels. This function slices each layer from a 3D array and multiplies each layer by the same matrix, like below

A = rand(4, 4, 3)
B = rand(4, 4)
C = similar(A) 
C = mapslices(x -> x * B, A, dims = [1, 2])

But when I tried to write the kernel

using CUDA

A = CUDA.rand(4, 4, 3)
B = CUDA.rand(4, 4)
C = similar(A)

function kernel!(C, A, B)
	i = threadIdx().x

	if (i <= size(A, 3))
		@inbounds C[:, :, i] = A[:, :, i] * B
	end

	return nothing
end

@cuda threads = size(A, 3) kernel!(C, A, B)

When I tried running this kernel, the Julia REPL unexpectedly quit and did not show any results. I attempted to run the kernels with lower dimensional arrays, but the result showed errors. I suspect that accessing a column, row, or layer might not be good in the kernel, as opposed to accessing a single element.

Am I right? If so, are there other ways to execute mapslices() from my example in parallel? I’ve considered multithreading, but it seems to work serially, not in parallel.

The crash is unexpected, and I filed an issue for it. However, besides that you’re making the same “mistake” as in the other thread: GPU kernel functions should generally be scalar, array operations do not work there (or at least have many restrictions). The array slice and multiplication, C[:, :, i] = A[:, :, i] * B, is such an unsupported array operation that is not GPU compatible.

That is also why mapslices and eachslices etc are unsupported, as they take a closure that performs array operations. The only feasible option right now is to execute that closure for each slice from the host, but that results in many kernel launches and generally performs badly. The real solution here is a tensor compiler that knows how to analyze the nested array ops and generate a single kernel, but that’s not what CUDA.jl (or any of our other GPU back-ends in Julia) currently is.

1 Like

Update my solutions here.
Shorter but not efficient enough solution. This solution aligns each 3D array layer in 2D array, then computes the result of 2D array multiplication and last converts back to 3D array.

using CUDA

A = CUDA.rand(4, 4, 3)
B = CUDA.rand(4, 4)

C = reshape(permutedims(A, [1, 3, 2]), size(A, 1) * size(A, 3), :) * B
C = permutedims(reshape(C, size(A, 1), size(A, 3), :), [1, 3, 2])

Longer but more efficient solution. Use for loop to get each entry of the final 3D array. This solution requires that the middle dimension of two arrays does not go too large (i.e. A (m,n,k), B(n, r), n is not very large).

using CUDA

A = CUDA.rand(4, 4, 3)
B = CUDA.rand(4, 4)
C = CUDA.zeros(4, 4, 3)

function kernel!(C, A, B)
    i = threadIdx().x
    j = threadIdx().y
    k = threadIdx().z

    if (i <= size(C, 1) && j <= size(C, 2) && k <= size(C, 3))
        for ii in 1:size(C, 2)
            @inbounds C[i, j, k] += A[i, ii, k] * B[ii, j]
        end
    end

    return nothing
end