Kernel optimization and shared memory

Fabrice_Rosay · July 9, 2021, 3:52pm

Hi,
I try to optimize the following kernel

    using CUDA
    function kernel(A,B)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i in index:stride:size(A,1)
        α=0
        β=0
        for j in 1:100
            for k in 1:size(A,2)
                top=A[i,k]
                bot=B[i,k]
                α+=top/bot
                β-=top/(bot*bot)
            end
        end
    end
    return
end

I tried to use shared memory resulting in the following kernel

function skernel(A,B,nthreads)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    off=size(A,2)
    # shared=@cuDynamicSharedMem(Float32,size(A,2)*nthreads)
    # p=view(shared,off*(threadIdx().x-1)+1:off*(threadIdx().x))
    # q=view(shared,off*(threadIdx().x-1)+1:off*(threadIdx().x))
    p=@cuDynamicSharedMem(Float32,size(A,2),32*2*off*(threadIdx().x-1))
    q=@cuDynamicSharedMem(Float32,size(A,2),32*off*(2*(threadIdx().x-1)+1))
    for i in index:stride:size(A,1)
        for k in 1:size(A,2)
            p[k]=A[i,k]
            q[k]=B[i,k]
        end
        α=0
        β=0
        for j in 1:100
            for k in 1:size(A,2)
                
                α+=p[k]/q[k]
                β-=p[k]/(q[k]*q[k])
            end
        end
    end
    return
end

And using shared memory like that results in poor performance (10 times slower for my application) using this function to time both kernel:

function test_kernels(L,n,threads)
    numblocks = ceil(Int,L/256)
    snumblocks = ceil(Int,L/threads)
    A=CUDA.rand(L,n)
    B=CUDA.rand(L,n)
    @btime CUDA.@sync begin
        @cuda threads=256 blocks=$numblocks kernel($A,$B)
    end
    @btime CUDA.@sync begin
        @cuda threads=$threads blocks=$snumblocks shmem=32*2*$n*$threads skernel($A,$B,$threads)
    end
end

My main use case is L=32*1024 and n=81, du to memory limitations with the second kernel you can only use up to 8 threads per block.

I wonder if this is poor/useless use of shared memory ?
(using view instead of offset is a little bit faster, but slower than using global memory, although data is used 100 time in the loop)

maleadt · July 9, 2021, 4:02pm

I don’t have time for an in-dept look, but 8 threads is probably way to few to effectively use the GPU. With only part of a single warp in flight, the GPU can’t hide the latency of any operation. You’ll have to reduce the amount of shared memory you use (can you tile the computation, or decompose it in another way?). Furthermore, IIUC you’re just using shared memory as a buffer for each thread, and not to send data over to other threads. Typically you also want to reuse that data in other threads, to reduce global memory accesses, if possible.

Topic		Replies	Views
Is sharedmemory really accelerates GPU kernel? Specific Domains gpu	1	95	December 2, 2024
Trying to understand the use of shared memory on GPUs GPU	3	2286	May 25, 2021
Tips on writing kernels? GPU	4	935	August 10, 2021
Correct usage of shared memory？ GPU	5	873	January 20, 2024
Working with shared memory as one or more variables, what is a good approach? GPU	2	336	January 30, 2023

Kernel optimization and shared memory

Related topics