I don't understand why it is slower with CuStaticSharedArray

eldee · March 12, 2025, 5:16pm

Thanks. Such diagrams are always useful.

But regarding the code, could you provide it in a form we can just run it by copy-pasting? I.e. something of the form

using CUDA, BenchmarkTools
(...)

const Ti = Int32
const Tf = Float64

const Δ::Tf = Tf(1e-2)  # (we don't know what your df contains)
(...)

function kernel_comp_v_noshmem!(...)
    (...)
end

function kernel_comp_v!(...)
    (...)
end

v = (...)  # e.g. CUDA.rand(...)
(...)

display(@benchmark CUDA.@sync begin 
   kernel_comp_v_noshmem!!($v_temp, $v, $F; threads = block_dim, blocks = grid_dim)
end)
display(@benchmark CUDA.@sync begin 
   comp_v!($v_temp, $v, $F; threads = block_dim, blocks = grid_dim)
end)

Topic		Replies	Views
CuArray local scope memory issue GPU	4	308	January 4, 2023
Faster read only memory GPU arrayfire , cudanative , cuda , memory , memory-allocation	5	1561	January 8, 2020
Bug with Julia 1.7.1 and CUDA 3.3 GPU bug , cuda	26	2397	June 2, 2022
CUDA \| nested loops kernel GPU question	5	162	May 12, 2025
GPU-Kernel function for fast matrix multiplication using shared memory GPU kernel	1	1745	August 13, 2021

I don't understand why it is slower with CuStaticSharedArray

Related topics