Improving cache behavior with ldg()

Michael_Benton · January 21, 2020, 4:27pm

What exactly is meant by https://juliagpu.github.io/CUDAnative.jl/stable/lib/device/array.html

that ldg “loads the value through the read-only texture cache for improved cache behavior” ?

The following is a memory bandwidth bound kernel that spends most of the execution time waiting on reads from global memory. It performs the same regardless of syntax ldg(in,I), or in[I]

using CuArrays, CUDAnative, BenchmarkTools
a = 10000;
b = 5000;
in = CuArrays.rand(a,b);
out = CuArrays.zeros(b);
function sum!(out,in)
  a = size(in,1)
  i = threadIdx().x + (blockIdx().x-1)*blockDim().x
  I = (i-Int32(1))*Int32(a)
  val = Float32(0)
  for j = 1:a
    I += 1
    @inbounds val += ldg(in,I)
    # @inbounds val += in[I]
  end
  @inbounds out[i] = val
  return nothing
end
@btime CuArrays.@sync @cuda threads=500 blocks=Int64(b/500) sum!(out,in)

maleadt · January 21, 2020, 4:36pm

Google for __ldg, it’s the same construct as from CUDA C.

Gnimuc · January 22, 2020, 4:49am

It’s the same as using __restrict__ identifier in CUDA C/C++. You could read more about this topic in this post.

It makes no difference because the global memory accessing pattern in your kernel is non-coalescing, and in is rarely hit in L1 cache.

Gnimuc · January 22, 2020, 5:04am

The block size(threads=500) also looks wired to me, CUDA blocks are split up into warps(32 threads), there are still 16 warps running concurrently per block even if you set the block size to 500.

maleadt · January 22, 2020, 6:57am

They are not really related: restrict teaches the compiler that pointers do not alias, which (often in combination with const) might have it understand certain reads are from read-only memory and could be optimized as such. ldg bypasses any analysis and forces the hardware to read through the texture cache, which should only be used in the case of read-only memory. Using the latter with read-write memory will result in corruption, while you can safely tag such pointers restrict.

Do you have a source for that? IIUC these reads are never cached in L1, but in L2 or in the read-only data cache depending on the use of ldg. I’m also not sure what memory coalescing has to do with that.

Gnimuc · January 22, 2020, 11:16am

Maybe I shouldn’t say “It’s the same”, as explained above, from a compiler’s perspective, they’re quite different. But what I really meant is that, from a user’s perspective, __restrict__ and ldg are just two alternative ways to enable read-only loads(ref: S0514-GTC2012-GPU-Performance-Analysis page 32). In this sense, they’re related.

It performs the same regardless of syntax ldg(in,I) , or in[I]

It makes no difference because the global memory accessing pattern in your kernel is non-coalescing, and in is rarely hit in L1 cache.

IIUC these reads are never cached in L1, but in L2 or in the read-only data cache depending on the use of ldg . I’m also not sure what memory coalescing has to do with that.

Using read-only loads or not should not perform exactly the same, but it looks like the performance difference is ignorable as OP reported above. I think the reason is just like the scenarios described on S0514-GTC2012-GPU-Performance-Analysis page 44-45.

In the case of caching load(in[I] ), the bus utilization is 128/(N * 128); in the case of read-only load(ldg(in,I)), the bus utilization is 128/(N * 32). When N is quite large, the utilization difference can be ignored. The lower bus utilization means there are more “wasted” transactions which in turn affects the effective memory throughput.

That’s why I thought the main reason for the bad performance is due to the non-coalescing global memory accessing pattern rather than whether to use read-only cache or not.

Michael_Benton · February 6, 2020, 11:56pm

For someone else reading this, a good reference for how ldg works:

Topic		Replies	Views
Faster read only memory GPU arrayfire , cudanative , cuda , memory , memory-allocation	5	1559	January 8, 2020
CUDAnative: kernel multidimensional access GPU cudanative	3	1167	February 3, 2017
Performance of kernel function GPU	3	456	November 28, 2019
Problem with GPU programming GPU cudanative , cuda	4	1059	September 13, 2019
Question about CUDA kernels GPU question	4	587	February 10, 2023

Improving cache behavior with ldg()

Related topics