What exactly is meant by https://juliagpu.github.io/CUDAnative.jl/stable/lib/device/array.html
that ldg “loads the value through the read-only texture cache for improved cache behavior” ?
The following is a memory bandwidth bound kernel that spends most of the execution time waiting on reads from global memory. It performs the same regardless of syntax
using CuArrays, CUDAnative, BenchmarkTools a = 10000; b = 5000; in = CuArrays.rand(a,b); out = CuArrays.zeros(b); function sum!(out,in) a = size(in,1) i = threadIdx().x + (blockIdx().x-1)*blockDim().x I = (i-Int32(1))*Int32(a) val = Float32(0) for j = 1:a I += 1 @inbounds val += ldg(in,I) # @inbounds val += in[I] end @inbounds out[i] = val return nothing end @btime CuArrays.@sync @cuda threads=500 blocks=Int64(b/500) sum!(out,in)