CUDA unexplainable SPEEDUP! Local memory?

Hey,

I was trying to implement a kernel function when I faced with this incredible big difference:

using BenchmarkTools
using CUDA
fast(v, nins,k,j,glay,glay2) = begin
	I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
	if I > 1000
		return
	end
	@inbounds while glay[I] <= glay2[I]
		p=k[1]
		while p < j[1]
			v[I] += v[nins[p]+I+1000]
			p+=1
		end
		glay[I]+=1
	end
end
slow(v, nins,k,j,layer) = begin
	I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
	if I > 1000
		return
	end
	i=1
	@inbounds while i <= layer
		p=k[1]
		while p < j[1]
			v[I] += v[nins[p]+I+1000]
			p+=1
		end
		i+=1
	end
end

glay = CuArray(fill(1,1000))
glay2 = CuArray(fill(1,1000))
ff = CuArray(fill(4,1000,1000))
tt = CuArray(fill(10,1000,1000))
a2 = CuArray([1,2,3,4,5,6,7,8,9,10,11,12])
a1 = CuArray(randn(Float32,1000,1,20))

display(@benchmark CUDA.@sync @cuda threads=512 blocks=2  slow($a1,$a2,$ff,$tt,100))
 
glay = CuArray(fill(1,1000))
glay2 = CuArray(fill(1000,1000))
ff = CuArray(fill(4,1000,1000))
tt = CuArray(fill(10,1000,1000))
a2 = CuArray([1,2,3,4,5,6,7,8,9,10,11,12])
a1 = CuArray(randn(Float32,1000,1,20))
display(@benchmark CUDA.@sync @cuda threads=512 blocks=2  fast($a1,$a2,$ff,$tt,$glay,$glay2))

Results are crazy!

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  276.518 μs …  10.874 ms  ┊ GC (min … max): 0.00% … 99.26%
 Time  (median):     279.631 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   300.160 μs ± 173.301 μs  ┊ GC (mean ± σ):  0.65% ±  1.39%

  █▃                                                            ▁
  ██▇▅▇█▄▇▆█▆▅▄▃▃▃▁▄▆▄▃█▆▃▄▃▃▃▁▁▁▃▁▁▁▁▃▃▄▄▇▇▅▄▃▁▄▃▃▃▃▄▃▄▃▁▄▄▃▃▄ █
  277 μs        Histogram: log(frequency) by time       1.02 ms <

 Memory estimate: 2.02 KiB, allocs estimate: 59.
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.940 μs … 121.277 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.491 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.986 μs ±   4.433 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▃▇█▇▆▇▅▅▃▃▂▁▁▁    ▁  ▁ ▁                                  ▂
  ▆████████████████████████████▇▇▇▆▇▆▆▆▇▇▆▆▆▅▆▅▆▅▆▅▅▅▄▅▄▅▄▄▆▄ █
  4.94 μs      Histogram: log(frequency) by time      10.5 μs <

Basically an ++5000% speedup?
What am I doing wrong?
This sounds very serious. I am basically using 1000 different values for the for loop indexing and it is times faster. But the loop “i” should be also 1000 different values… What is going on here…

Can the problem be I don’t use local memory?

The diff between the two code is basically:

From this:

	i=1
	@inbounds while i <= layer
                ..... some special calculation with while loop
		glay[I]+=1
	end

To this:

	@inbounds while glay[I] <= glay2[I]
                ..... some special calculation with while loop
		i+=1
	end

Sounds like the local “i” iterator in the while/for loop is problematic when nested loop comes into play.

Oh… can it happen, that it only run once? Each other time it just measured 0 iteration. :frowning:

Easiest way to find out would be to run both versions of the code on the same set of inputs and compare the results, no?

The results are great both time.

The problem is that, @benchmark calculated the result once and 9999 times did basically 0 iteration as it already reached the end of the loop each case.

“When you realise you hacked your videocard and reached infinite speed and realise you just modded your code as many time as you made a mistake and basically did nothing.” XD