Faster read only memory

I’ve got something that works, but it needs to work faster. I’ve tested different methods outside of Julia that can do this much faster, and by comparison with peakflops.jl and a rudimentary calculation of how many operations are needed, this should be able to execute much faster

That something is a nearest neighbor interpolation summing method, here’s some pseudocode:

using CuArrays, CUDAnative, CUDAdrv

function interpSum!(out,in,<variables used to index>)
i = threadIdx().x
j = blockIdx().x
k = blockIdx().y
for l 
  <compute x1 and x2>
  for m = 
    @inbounds out[i,j,k,m] += in[x1,x2,l,m]
  end
end
return nothing
end


@time CuArrays.@sync @cuda threads=512 blocks=(200,200) interpSum!(out,in,<variables used to index>)

the part I want to stress is: @inbounds out[i,j,k,m] += in[x1,x2,l,m]

if I comment that out, the code takes ~50 ms to run, with it being used, code takes 1000 ms to run.

if I replace it with @inbounds out[i,j,k,m] += 1, it takes 300 ms to run (barely acceptable)

Which tells me it’s a memory access problem, given that to the compiler, the indices x1 and x2 are random (and there aren’t any ways for me to make them ordered or predictable)

I’ve tried @inbounds map[i,j,k,m] += ldg(data,index) where I compute the 1 dimensional index, and it is exactly as slow (I’ve read in other forums that ldg is read only and should be faster, but here it is not)

I’ve also tried interpolation methods in arrayfire, I had trouble with the memory management, if I find that code again, I’ll come back here and post it.

Fundamentally, I’m interpolating and summing, so if there already exists a fast gpu interpolation method, that would solve my problem, and I’ve seen a lot of convincing benchmarks that indicate texture memory would work really well (and I’m pretty sure that’s what arrayfire uses, but it reallocates the memory every time you call interpolation method, so after a few calls I get a memory allocation error and I have to restart Julia)

The compiler doesn’t matter here, it’s your code that’s slow :slight_smile: We don’t do the kind of optimizations you seem to expect (changing or optimizing memory access patterns).

To optimize this, you’ll need to look into the GPU’s architectural details and avoid costly operations. For example, you are hitting global memory here all the time, both loading from and writing to it through in and out. You could restructure your kernel to buffer this data in local memory. You are also doing so in a random pattern, defeating memory memory coalescing. Again, if you buffer locally and read/write to global memory in consecutive chunks, different memory transactions might be able to get merged. See for example CuArray is Row Major or Column Major?

I suggest you look into optimizing bandwidth-bound CUDA kernels (which yours seem to be) and get familiar with the necessary tools (notably nvprof/nsight-compute).

I’m not sure what you mean by this. Does an equivalent CUDA C kernel perform better? That would be the only fair comparison.

And FWIW, please try to post runnable snippets. It’s much harder to help if just brainstorming based on pseudocode.

3 Likes

I’m saying x1 and x2 are random, so I can’t use memory coalescing, which is the type of optimization I’m expecting.

runable script:

using CuArrays, CUDAnative

A = 512;
B = 512;
C1 = 10;
C2 = 30;
D = 10;

in = CuArrays.rand(A,B,C1,D);
out = CuArrays.zeros(A,B,C2,D);

aVec = CuArrays.rand(A) .*A;
bVec = CuArrays.rand(B) .*B;
c1Vec = CuArrays.rand(C1) .*C1;
c2Vec = CuArrays.rand(C2) .*C2;

function interpSum!(out,in,aVec,bVec,c1Vec,c2Vec)
i = threadIdx().x; j = blockIdx().x; k = blockIdx().y;
@inbounds a = aVec[i]; @inbounds b = bVec[j]; @inbounds c2 = c2Vec[k]
A = size(out,1); B = size(out,2); C1 = size(in,3); D = size(out,4)
for l = 1:C1
  @inbounds c1 = c1Vec[l]
  x1 = mod( Int32( round(a - b*c1 + c2*c1^2/2) ) + 1,A) + 1
  x2 = mod( Int32( round(b + a*c1 - c2*c1^2/2) ) + 1,B) + 1
  for m = 1:D
    @inbounds out[i,j,k,m] += in[x1,x2,l,m]
  end
end
return nothing
end
@time CuArrays.@sync @cuda threads=A blocks=(B,C2) interpSum!(out,in,aVec,bVec,c1Vec,c2Vec)

A CUDA C kernel using texture memory performs significantly better, I’m mainly looking for the Julia equivalent

This issue might help show the current state of texture memory. https://github.com/JuliaGPU/CUDAnative.jl/issues/426

1 Like

That’s not “equivalent”, but using dedicated hardware which explains any speedup. The WIP at https://github.com/cdsousa/CuTextures.jl is a good implementation of that functionality, and will probably end up being part of CUDAnative as soon as somebody puts in the work :slightly_smiling_face:

1 Like