Kernel optimization and shared memory

Hi,
I try to optimize the following kernel

    using CUDA
    function kernel(A,B)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i in index:stride:size(A,1)
        α=0
        β=0
        for j in 1:100
            for k in 1:size(A,2)
                top=A[i,k]
                bot=B[i,k]
                α+=top/bot
                β-=top/(bot*bot)
            end
        end
    end
    return
end

I tried to use shared memory resulting in the following kernel

function skernel(A,B,nthreads)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    off=size(A,2)
    # shared=@cuDynamicSharedMem(Float32,size(A,2)*nthreads)
    # p=view(shared,off*(threadIdx().x-1)+1:off*(threadIdx().x))
    # q=view(shared,off*(threadIdx().x-1)+1:off*(threadIdx().x))
    p=@cuDynamicSharedMem(Float32,size(A,2),32*2*off*(threadIdx().x-1))
    q=@cuDynamicSharedMem(Float32,size(A,2),32*off*(2*(threadIdx().x-1)+1))
    for i in index:stride:size(A,1)
        for k in 1:size(A,2)
            p[k]=A[i,k]
            q[k]=B[i,k]
        end
        α=0
        β=0
        for j in 1:100
            for k in 1:size(A,2)
                
                α+=p[k]/q[k]
                β-=p[k]/(q[k]*q[k])
            end
        end
    end
    return
end

And using shared memory like that results in poor performance (10 times slower for my application) using this function to time both kernel:

function test_kernels(L,n,threads)
    numblocks = ceil(Int,L/256)
    snumblocks = ceil(Int,L/threads)
    A=CUDA.rand(L,n)
    B=CUDA.rand(L,n)
    @btime CUDA.@sync begin
        @cuda threads=256 blocks=$numblocks kernel($A,$B)
    end
    @btime CUDA.@sync begin
        @cuda threads=$threads blocks=$snumblocks shmem=32*2*$n*$threads skernel($A,$B,$threads)
    end
end

My main use case is L=32*1024 and n=81, du to memory limitations with the second kernel you can only use up to 8 threads per block.

I wonder if this is poor/useless use of shared memory ?
(using view instead of offset is a little bit faster, but slower than using global memory, although data is used 100 time in the loop)

I don’t have time for an in-dept look, but 8 threads is probably way to few to effectively use the GPU. With only part of a single warp in flight, the GPU can’t hide the latency of any operation. You’ll have to reduce the amount of shared memory you use (can you tile the computation, or decompose it in another way?). Furthermore, IIUC you’re just using shared memory as a buffer for each thread, and not to send data over to other threads. Typically you also want to reuse that data in other threads, to reduce global memory accesses, if possible.