Custom CUDA kernel for nontrivial sum calculations

Suppose I want to calculate the following sum: C_{nm}=\sum_i A_{ni}B_{im} (in reality, something more complicated). The corresponding code could be

function cpu_add!(C, A, B)
    @. C = 0
    for i in axes(A,1)
        for m=1:2, n=1:2
            C[n,m] += A[i,n] * B[i,m]
    return nothing

N = 100
A = ones(N, 2)
B = ones(N, 2)
C = zeros(2, 2)

cpu_add!(C, A, B)

Now I would like to accelerate this calculation using CUDA. I know how to write the corresponding kernel for a single pair of n and m:

using CUDA

function gpu_add_single(A, B, n, m)
    T = eltype(A)
    N = size(A, 1)
    nth = min(256, prevpow(2,N))
    nbl = cld(N, nth)
    shmem = sizeof(T) * nth
    psum = CUDA.zeros(T, nbl)   # partial sum
    @sync @cuda threads=nth blocks=nbl shmem=shmem kernel_single(psum, A, B, n, m)
    return sum(psum)

function kernel_single(psum, A, B, n, m)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    Ctmp = zero(eltype(psum))
    for i=id:stride:size(A,1)
        Ctmp += A[i,n] * B[i,m]

    block_reduction!(psum, Ctmp)
    return nothing

function block_reduction!(psum, temp)
    shmem = @cuDynamicSharedMem(eltype(psum), blockDim().x)
    ishmem = threadIdx().x
    shmem[ishmem] = temp

    # for reductions, nthreads must be a power of 2:
    i = div(blockDim().x, 2)
    while i >= 1
        if ishmem <= i
            shmem[ishmem] += shmem[ishmem + i]
        i = div(i, 2)

    if ishmem == 1
        psum[blockIdx().x] = shmem[1]
    return nothing

N = 100
A = CUDA.ones(N, 2)
B = CUDA.ones(N, 2)

C12 = gpu_add_single(A, B, 1,2)

However, I do not understand how to write the kernel which do the summation for all n and m at once, like in the CPU code above. Here, instead of the 1D array for partial sum psum, I should have a higher dimensional array. But I do not get how to properly work with it, e.g. how to work with the corresponding shared memory and how to make the proper block reduction. Can you please help me with it?

We don’t really have reduction primitives for use in kernels, only mapreducedim! from the host is supported. That said, you can try and re-use CUDA.reduce_block, e.g., look at the example in Training/2-2-kernel_analysis_optimization.ipynb at master · JuliaComputing/Training · GitHub

Thank you. I will take a look.