I don't understand why it is slower with CuStaticSharedArray

TimHargreaves · March 13, 2025, 9:54pm

Ah, my bad! Your shared memory kernel actually looked pretty good so I assumed you had more experience.

You might want to check out this YouTube tutorial series. It explains some of the deeper considerations when designing a CUDA kernel. It’s using C++ but the syntax isn’t that dissimilar from CUDA.jl so you’ll be able to pick up the key ideas.

I always believed that arrays are saved, by default, in the global memory. Are the arrays load in L2 cache when the kernel is called ?

The video series I linked explains this much clearer than I’ll be able to but the rough idea is that all memory reads from global memory into registers/shared memory goes through the L2 cache. If you access the same 128-byte cache line from different threads before it has been purged, you’ll read it from the L2 cache the second time rather than directly from global memory.

Topic		Replies	Views
CuArray local scope memory issue GPU	4	308	January 4, 2023
Faster read only memory GPU arrayfire , cudanative , cuda , memory , memory-allocation	5	1559	January 8, 2020
Bug with Julia 1.7.1 and CUDA 3.3 GPU bug , cuda	26	2395	June 2, 2022
CUDA \| nested loops kernel GPU question	5	162	May 12, 2025
GPU-Kernel function for fast matrix multiplication using shared memory GPU kernel	1	1745	August 13, 2021

I don't understand why it is slower with CuStaticSharedArray

Related topics