Correct usage of shared memory？

FR13ndSDP · December 24, 2023, 7:44am

Hello everyone, I tried to test the performance of shared memory of CUDA.jl with this code:

using CUDA, BenchmarkTools, Plots

# initialize
const N::Int64 = 1024
ϕ = ones(Float64, N, N)
ϕ[1, :] .= 10.0

function evolve(ϕ, ϕn, N)
    i = (blockIdx().x-1)* blockDim().x + threadIdx().x
    j = (blockIdx().y-1)* blockDim().y + threadIdx().y

    if i < 2 || i > N-1 || j < 2 || j > N-1
        return
    end

    @inbounds ϕn[i,j] = 0.25*(ϕ[i-1,j] + ϕ[i+1,j] + ϕ[i,j-1] + ϕ[i,j+1])
    return
end

function evolve_shared(ϕ, ϕn, N)
    i = (blockIdx().x-1)* blockDim().x + threadIdx().x
    j = (blockIdx().y-1)* blockDim().y + threadIdx().y
    
    if i > N || j > N
        return
    end

    s = CuStaticSharedArray(Float64, 18*18)
    local_i = threadIdx().x
    local_j = threadIdx().y

    @inbounds s[local_i * 18 + local_j+1] = ϕ[i, j]
    #top 
    if local_i == 1 && i != 1
        @inbounds s[local_j+1] = ϕ[i-1, j]
    end
    #bottom
    if local_i == 16 && i != N
        @inbounds s[17*18+local_j+1] = ϕ[i+1, j]
    end
    #left
    if local_j == 1 && j != 1
        @inbounds s[local_i*18+local_j] = ϕ[i, j-1]
    end
    #right
    if local_j == 16 && j != N
        @inbounds s[local_i*18+local_j+2] = ϕ[i, j+1]
    end

    sync_threads()

    if i < 2 || i > N-1 || j < 2 || j > N-1
        return
    end

    @inbounds ϕn[i,j] = 0.25*(s[(local_i-1)*18+local_j+1] + s[(local_i+1)*18+local_j+1] 
                             +s[local_i*18+local_j] + s[local_i*18+local_j+2])
    return
end

function run(ϕ, N)
    nthreads = (16, 16)
    nblock = (cld(N, 16), cld(N, 16))

    ϕ_d = CuArray(ϕ)
    ϕn_d = copy(ϕ_d)

    for _ ∈ 1:100
        @cuda blocks=nblock threads=nthreads evolve(ϕ_d, ϕn_d, N)
        ϕ_d, ϕn_d = ϕn_d, ϕ_d
    end

    copyto!(ϕ, ϕ_d)
end

@benchmark run(ϕ, N)
# heatmap(ϕ)

With evolve_shared , It can theoretically improve performance by reducing access to global memory, but my test results show that using shared memory is slower. What’s wrong with my code? Thanks in advance.

maleadt · January 8, 2024, 10:45am

Nothing is wrong with your code, but it looks like the device was able to satisfy all of the duplicate memory loads from your original kernel from its cache already:

vs. the manual use of shared memory:

As you can see, the total amount of memory traffic to device memory is identical.

FR13ndSDP · January 20, 2024, 10:58am

Thanks for your reply. I translated the above julia code directly into CUDA C code and found that the manual use of shared memory was faster, and the julia code (original evolve kernel) was even faster than this “optimized” CUDA C code (amazing!). So I want to know 1. Why the original julia kernel can handle duplicate memory loads while CUDA C code cannot 2. Does this mean that I don’t need to use shared memory in my case?

maleadt · January 20, 2024, 11:49am

Julia and CUDA C use two different compiler toolchains, so here it probably means that our compiler managed to optimize the code in a way that was more cache-friendly. We don’t do anything special that would make it so that you don’t need to use shared memory in general.

Palli · January 20, 2024, 2:51pm

What does that mean? That (part of) the toolchain is implemented in Julia (i.e. not accessible for CUDA C), or that there are two CUDA toolchains to choose from and CUDA C chose differently (why?), or Julia, newer one?

maleadt · January 20, 2024, 6:28pm

Julia/CUDA.jl uses LLVM’s upstream NVPTX back-end to generate PTX; NVIDIA uses their own proprietary NVVM back-end. Compilation from PTX to SASS is identical between both stacks.

Topic		Replies	Views
GPU-Kernel function for fast matrix multiplication using shared memory GPU kernel	1	1766	August 13, 2021
I don't understand why it is slower with CuStaticSharedArray New to Julia gpu , cuda , sharedarrays , cudajl	9	321	March 17, 2025
Kernel optimization and shared memory GPU	1	447	July 9, 2021
Initializing @cuStaticSharedMem array? GPU	3	1369	May 12, 2018
Trying to understand the use of shared memory on GPUs GPU	3	2301	May 25, 2021

Correct usage of shared memory？

Related topics