How to do mapslices() in parallel for 3D arrays

Update my solutions here.
Shorter but not efficient enough solution. This solution aligns each 3D array layer in 2D array, then computes the result of 2D array multiplication and last converts back to 3D array.

using CUDA

A = CUDA.rand(4, 4, 3)
B = CUDA.rand(4, 4)

C = reshape(permutedims(A, [1, 3, 2]), size(A, 1) * size(A, 3), :) * B
C = permutedims(reshape(C, size(A, 1), size(A, 3), :), [1, 3, 2])

Longer but more efficient solution. Use for loop to get each entry of the final 3D array. This solution requires that the middle dimension of two arrays does not go too large (i.e. A (m,n,k), B(n, r), n is not very large).

using CUDA

A = CUDA.rand(4, 4, 3)
B = CUDA.rand(4, 4)
C = CUDA.zeros(4, 4, 3)

function kernel!(C, A, B)
    i = threadIdx().x
    j = threadIdx().y
    k = threadIdx().z

    if (i <= size(C, 1) && j <= size(C, 2) && k <= size(C, 3))
        for ii in 1:size(C, 2)
            @inbounds C[i, j, k] += A[i, ii, k] * B[ii, j]
        end
    end

    return nothing
end