Correct usage of shared memory?

Hello everyone, I tried to test the performance of shared memory of CUDA.jl with this code:

using CUDA, BenchmarkTools, Plots

# initialize
const N::Int64 = 1024
ϕ = ones(Float64, N, N)
ϕ[1, :] .= 10.0

function evolve(ϕ, ϕn, N)
    i = (blockIdx().x-1)* blockDim().x + threadIdx().x
    j = (blockIdx().y-1)* blockDim().y + threadIdx().y

    if i < 2 || i > N-1 || j < 2 || j > N-1
        return
    end

    @inbounds ϕn[i,j] = 0.25*(ϕ[i-1,j] + ϕ[i+1,j] + ϕ[i,j-1] + ϕ[i,j+1])
    return
end

function evolve_shared(ϕ, ϕn, N)
    i = (blockIdx().x-1)* blockDim().x + threadIdx().x
    j = (blockIdx().y-1)* blockDim().y + threadIdx().y
    
    if i > N || j > N
        return
    end

    s = CuStaticSharedArray(Float64, 18*18)
    local_i = threadIdx().x
    local_j = threadIdx().y

    @inbounds s[local_i * 18 + local_j+1] = ϕ[i, j]
    #top 
    if local_i == 1 && i != 1
        @inbounds s[local_j+1] = ϕ[i-1, j]
    end
    #bottom
    if local_i == 16 && i != N
        @inbounds s[17*18+local_j+1] = ϕ[i+1, j]
    end
    #left
    if local_j == 1 && j != 1
        @inbounds s[local_i*18+local_j] = ϕ[i, j-1]
    end
    #right
    if local_j == 16 && j != N
        @inbounds s[local_i*18+local_j+2] = ϕ[i, j+1]
    end

    sync_threads()

    if i < 2 || i > N-1 || j < 2 || j > N-1
        return
    end

    @inbounds ϕn[i,j] = 0.25*(s[(local_i-1)*18+local_j+1] + s[(local_i+1)*18+local_j+1] 
                             +s[local_i*18+local_j] + s[local_i*18+local_j+2])
    return
end

function run(ϕ, N)
    nthreads = (16, 16)
    nblock = (cld(N, 16), cld(N, 16))

    ϕ_d = CuArray(ϕ)
    ϕn_d = copy(ϕ_d)

    for _ ∈ 1:100
        @cuda blocks=nblock threads=nthreads evolve(ϕ_d, ϕn_d, N)
        ϕ_d, ϕn_d = ϕn_d, ϕ_d
    end

    copyto!(ϕ, ϕ_d)
end

@benchmark run(ϕ, N)
# heatmap(ϕ)

With evolve_shared , It can theoretically improve performance by reducing access to global memory, but my test results show that using shared memory is slower. What’s wrong with my code? Thanks in advance.

Nothing is wrong with your code, but it looks like the device was able to satisfy all of the duplicate memory loads from your original kernel from its cache already:

vs. the manual use of shared memory:

As you can see, the total amount of memory traffic to device memory is identical.

1 Like

Thanks for your reply. I translated the above julia code directly into CUDA C code and found that the manual use of shared memory was faster, and the julia code (original evolve kernel) was even faster than this “optimized” CUDA C code (amazing!). So I want to know 1. Why the original julia kernel can handle duplicate memory loads while CUDA C code cannot 2. Does this mean that I don’t need to use shared memory in my case?

2 Likes

Julia and CUDA C use two different compiler toolchains, so here it probably means that our compiler managed to optimize the code in a way that was more cache-friendly. We don’t do anything special that would make it so that you don’t need to use shared memory in general.

3 Likes

What does that mean? That (part of) the toolchain is implemented in Julia (i.e. not accessible for CUDA C), or that there are two CUDA toolchains to choose from and CUDA C chose differently (why?), or Julia, newer one?

Julia/CUDA.jl uses LLVM’s upstream NVPTX back-end to generate PTX; NVIDIA uses their own proprietary NVVM back-end. Compilation from PTX to SASS is identical between both stacks.