Custom CUDA kernel for nontrivial sum calculations

fedoroff · September 19, 2022, 1:35pm

Suppose I want to calculate the following sum: C_{nm}=\sum_i A_{ni}B_{im} (in reality, something more complicated). The corresponding code could be

function cpu_add!(C, A, B)
    @. C = 0
    for i in axes(A,1)
        for m=1:2, n=1:2
            C[n,m] += A[i,n] * B[i,m]
        end
    end
    return nothing
end


N = 100
A = ones(N, 2)
B = ones(N, 2)
C = zeros(2, 2)

cpu_add!(C, A, B)

Now I would like to accelerate this calculation using CUDA. I know how to write the corresponding kernel for a single pair of n and m:

using CUDA


function gpu_add_single(A, B, n, m)
    T = eltype(A)
    N = size(A, 1)
    nth = min(256, prevpow(2,N))
    nbl = cld(N, nth)
    shmem = sizeof(T) * nth
    psum = CUDA.zeros(T, nbl)   # partial sum
    @sync @cuda threads=nth blocks=nbl shmem=shmem kernel_single(psum, A, B, n, m)
    return sum(psum)
end


function kernel_single(psum, A, B, n, m)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    Ctmp = zero(eltype(psum))
    for i=id:stride:size(A,1)
        Ctmp += A[i,n] * B[i,m]
    end

    block_reduction!(psum, Ctmp)
    return nothing
end


function block_reduction!(psum, temp)
    shmem = @cuDynamicSharedMem(eltype(psum), blockDim().x)
    ishmem = threadIdx().x
    shmem[ishmem] = temp
    sync_threads()

    # for reductions, nthreads must be a power of 2:
    i = div(blockDim().x, 2)
    while i >= 1
        if ishmem <= i
            shmem[ishmem] += shmem[ishmem + i]
        end
        i = div(i, 2)
        sync_threads()
    end

    if ishmem == 1
        psum[blockIdx().x] = shmem[1]
    end
    return nothing
end


N = 100
A = CUDA.ones(N, 2)
B = CUDA.ones(N, 2)

C12 = gpu_add_single(A, B, 1,2)

However, I do not understand how to write the kernel which do the summation for all n and m at once, like in the CPU code above. Here, instead of the 1D array for partial sum psum, I should have a higher dimensional array. But I do not get how to properly work with it, e.g. how to work with the corresponding shared memory and how to make the proper block reduction. Can you please help me with it?

maleadt · September 20, 2022, 6:49am

We don’t really have reduction primitives for use in kernels, only mapreducedim! from the host is supported. That said, you can try and re-use CUDA.reduce_block, e.g., look at the example in Training/2-2-kernel_analysis_optimization.ipynb at master · JuliaComputing/Training · GitHub

fedoroff · September 20, 2022, 11:11am

Thank you. I will take a look.

Topic		Replies	Views
Create a simple CUDA.sum kernel GPU	3	1938	January 3, 2021
CUDA sum kernels, threads and blocks, complex values Performance gpu , cuda	2	1396	February 3, 2021
Simple CUDA sum kernel General Usage	0	327	January 19, 2021
Simple kernel not working GPU	10	1175	July 12, 2020
Cumulative sum on GPUArray using KernelAbstractions GPU gpu , gpuarrays , kernelabstractions	4	186	December 24, 2024

Custom CUDA kernel for nontrivial sum calculations

Related topics