Is sharedmemory really accelerates GPU kernel?

Shared memory is not going to always improve performance. For one, it may lower occupancy as it’s a shared resource limiting how many threads can be launched. But also, you seem to be using it here to simply cache accesses to read-only arrays. Modern GPUs are much better at automatically caching such reads, which may explain why shared memory doesn’t help here. It is still very relevant as a communication mechanism between threads, e.g., to implement a reduction.

If you want to be sure, run these two kernels under NSight Compute, which can show you accurately how memory is accessed and cached:

1 Like