Improving cache behavior with ldg()

What exactly is meant by https://juliagpu.github.io/CUDAnative.jl/stable/lib/device/array.html

that ldg “loads the value through the read-only texture cache for improved cache behavior” ?

The following is a memory bandwidth bound kernel that spends most of the execution time waiting on reads from global memory. It performs the same regardless of syntax ldg(in,I), or in[I]

using CuArrays, CUDAnative, BenchmarkTools
a = 10000;
b = 5000;
in = CuArrays.rand(a,b);
out = CuArrays.zeros(b);
function sum!(out,in)
  a = size(in,1)
  i = threadIdx().x + (blockIdx().x-1)*blockDim().x
  I = (i-Int32(1))*Int32(a)
  val = Float32(0)
  for j = 1:a
    I += 1
    @inbounds val += ldg(in,I)
    # @inbounds val += in[I]
  end
  @inbounds out[i] = val
  return nothing
end
@btime CuArrays.@sync @cuda threads=500 blocks=Int64(b/500) sum!(out,in)

Google for __ldg, it’s the same construct as from CUDA C.

It’s the same as using __restrict__ identifier in CUDA C/C++. You could read more about this topic in this post.

It makes no difference because the global memory accessing pattern in your kernel is non-coalescing, and in is rarely hit in L1 cache.

The block size(threads=500) also looks wired to me, CUDA blocks are split up into warps(32 threads), there are still 16 warps running concurrently per block even if you set the block size to 500.

They are not really related: restrict teaches the compiler that pointers do not alias, which (often in combination with const) might have it understand certain reads are from read-only memory and could be optimized as such. ldg bypasses any analysis and forces the hardware to read through the texture cache, which should only be used in the case of read-only memory. Using the latter with read-write memory will result in corruption, while you can safely tag such pointers restrict.

Do you have a source for that? IIUC these reads are never cached in L1, but in L2 or in the read-only data cache depending on the use of ldg. I’m also not sure what memory coalescing has to do with that.

Maybe I shouldn’t say “It’s the same”, as explained above, from a compiler’s perspective, they’re quite different. But what I really meant is that, from a user’s perspective, __restrict__ and ldg are just two alternative ways to enable read-only loads(ref: S0514-GTC2012-GPU-Performance-Analysis page 32). In this sense, they’re related. :slightly_smiling_face:

It performs the same regardless of syntax ldg(in,I) , or in[I]

It makes no difference because the global memory accessing pattern in your kernel is non-coalescing, and in is rarely hit in L1 cache.

IIUC these reads are never cached in L1, but in L2 or in the read-only data cache depending on the use of ldg . I’m also not sure what memory coalescing has to do with that.

Using read-only loads or not should not perform exactly the same, but it looks like the performance difference is ignorable as OP reported above. I think the reason is just like the scenarios described on S0514-GTC2012-GPU-Performance-Analysis page 44-45.

In the case of caching load(in[I] ), the bus utilization is 128/(N * 128); in the case of read-only load(ldg(in,I)), the bus utilization is 128/(N * 32). When N is quite large, the utilization difference can be ignored. The lower bus utilization means there are more “wasted” transactions which in turn affects the effective memory throughput.

That’s why I thought the main reason for the bad performance is due to the non-coalescing global memory accessing pattern rather than whether to use read-only cache or not.

For someone else reading this, a good reference for how ldg works:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html