Faster read only memory

maleadt · December 19, 2019, 7:15am

The compiler doesn’t matter here, it’s your code that’s slow We don’t do the kind of optimizations you seem to expect (changing or optimizing memory access patterns).

To optimize this, you’ll need to look into the GPU’s architectural details and avoid costly operations. For example, you are hitting global memory here all the time, both loading from and writing to it through in and out. You could restructure your kernel to buffer this data in local memory. You are also doing so in a random pattern, defeating memory memory coalescing. Again, if you buffer locally and read/write to global memory in consecutive chunks, different memory transactions might be able to get merged. See for example CuArray is Row Major or Column Major?

I suggest you look into optimizing bandwidth-bound CUDA kernels (which yours seem to be) and get familiar with the necessary tools (notably nvprof/nsight-compute).

I’m not sure what you mean by this. Does an equivalent CUDA C kernel perform better? That would be the only fair comparison.

And FWIW, please try to post runnable snippets. It’s much harder to help if just brainstorming based on pseudocode.

Topic		Replies	Views
Accessing array elements too slow? GPU	10	593	April 23, 2021
GPU-Kernel function for fast matrix multiplication using shared memory GPU kernel	1	1747	August 13, 2021
Fastest way to add arrays Performance cuda	12	769	December 14, 2022
Optimizing the use of Blocks, Threads vs. Array Indexing GPU	15	3253	September 21, 2018
I don't understand why it is slower with CuStaticSharedArray New to Julia gpu , cuda , sharedarrays , cudajl	9	284	March 17, 2025

Faster read only memory

Related topics