Ah, my bad! Your shared memory kernel actually looked pretty good so I assumed you had more experience.
You might want to check out this YouTube tutorial series. It explains some of the deeper considerations when designing a CUDA kernel. It’s using C++ but the syntax isn’t that dissimilar from CUDA.jl so you’ll be able to pick up the key ideas.
I always believed that arrays are saved, by default, in the global memory. Are the arrays load in L2 cache when the kernel is called ?
The video series I linked explains this much clearer than I’ll be able to but the rough idea is that all memory reads from global memory into registers/shared memory goes through the L2 cache. If you access the same 128-byte cache line from different threads before it has been purged, you’ll read it from the L2 cache the second time rather than directly from global memory.