Is sharedmemory really accelerates GPU kernel?

Hello.
I’ve been learning sharedmemory from here:
https://pde-on-gpu.vaw.ethz.ch/lecture10/.

And I’m going through the motion to understand this. But when I benchmark the kernel I wrote based on the lecture, I found the profromance decreasing rather than increasing.

Here is my code:

using CUDA
using BenchmarkTools
nx = ny = 2^12
T    = CUDA.rand(Float64, nx, ny);
T2   = CUDA.rand(Float64, nx, ny);
Ci   = CUDA.rand(Float64, nx, ny);
lam = _dx = _dy = dt = rand();
threads = (32,8)
blocks = (nx÷threads[1], ny÷threads[2])

function update_temperature!(T2, T, Ci, lam, dt, _dx, _dy)
        ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
        iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
        if (ix>1 && ix<size(T2,1) && iy>1 && iy<size(T2,2))
            @inbounds T2[ix,iy] = T[ix,iy] + dt*Ci[ix,iy]*(
                                  - ((-lam*(T[ix+1,iy] - T[ix,iy])*_dx) - (-lam*(T[ix,iy] - T[ix-1,iy])*_dx))*_dx
                                  - ((-lam*(T[ix,iy+1] - T[ix,iy])*_dy) - (-lam*(T[ix,iy] - T[ix,iy-1])*_dy))*_dy
                                  )
        end
        return nothing
end

function update_temperature_sha!(T2, T, Ci, lam, dt, _dx, _dy)
        ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
        iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
        tx = threadIdx().x+1
        ty = threadIdx().y+1
        T_l = @cuDynamicSharedMem(eltype(T), (blockDim().x+2, blockDim().y+2))
        @inbounds T_l[tx,ty] = T[ix,iy]
        if (ix>1 && ix<size(T2,1) && iy>1 && iy<size(T2,2))
            @inbounds if (threadIdx().x == 1)            T_l[tx-1,ty] = T[ix-1,iy] end 
            @inbounds if (threadIdx().x == blockDim().x) T_l[tx+1,ty] = T[ix+1,iy] end
            @inbounds if (threadIdx().y == 1)            T_l[tx,ty-1] = T[ix,iy-1] end
            @inbounds if (threadIdx().y == blockDim().y) T_l[tx,ty+1] = T[ix,iy+1] end
            sync_threads()
            @inbounds T2[ix,iy] = T_l[tx,ty] + dt*Ci[ix,iy]*(
                - ((-lam*(T_l[tx+1,ty] - T_l[tx,ty])*_dx) - (-lam*(T_l[tx,ty] - T_l[tx-1,ty])*_dx))*_dx
                - ((-lam*(T_l[tx,ty+1] - T_l[tx,ty])*_dy) - (-lam*(T_l[tx,ty] - T_l[tx,ty-1])*_dy))*_dy
                )
        end
        return
end


t_it = @belapsed begin @cuda blocks=$blocks threads=$threads shmem=prod($threads.+2)*sizeof(Float64) update_temperature!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy); synchronize() end
T_eff = (9*1+1)*1/1e9*nx*ny*sizeof(Float64)/t_it
#T_eff = 2913.71 GB/s
t_it_s = @belapsed begin @cuda blocks=$blocks threads=$threads shmem=prod($threads.+2)*sizeof(Float64) update_temperature_sha!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy); synchronize() end
T_eff_s = (9*1+1)*1/1e9*nx*ny*sizeof(Float64)/t_it_s
#T_eff_s = 2894.07 GB/s

As you can see, The kernel that doesn’t use sharedmemory behaves a bit faster than the kernel using sharedmemory. Is there something wrong with my code or benchmark method?

Kind regards

1 Like

Shared memory is not going to always improve performance. For one, it may lower occupancy as it’s a shared resource limiting how many threads can be launched. But also, you seem to be using it here to simply cache accesses to read-only arrays. Modern GPUs are much better at automatically caching such reads, which may explain why shared memory doesn’t help here. It is still very relevant as a communication mechanism between threads, e.g., to implement a reduction.

If you want to be sure, run these two kernels under NSight Compute, which can show you accurately how memory is accessed and cached:

1 Like