CUDA unexplainable SPEEDUP! Local memory?

Marcell_Havlik · December 29, 2021, 1:36pm

Hey,

I was trying to implement a kernel function when I faced with this incredible big difference:

using BenchmarkTools
using CUDA
fast(v, nins,k,j,glay,glay2) = begin
	I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
	if I > 1000
		return
	end
	@inbounds while glay[I] <= glay2[I]
		p=k[1]
		while p < j[1]
			v[I] += v[nins[p]+I+1000]
			p+=1
		end
		glay[I]+=1
	end
end
slow(v, nins,k,j,layer) = begin
	I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
	if I > 1000
		return
	end
	i=1
	@inbounds while i <= layer
		p=k[1]
		while p < j[1]
			v[I] += v[nins[p]+I+1000]
			p+=1
		end
		i+=1
	end
end

glay = CuArray(fill(1,1000))
glay2 = CuArray(fill(1,1000))
ff = CuArray(fill(4,1000,1000))
tt = CuArray(fill(10,1000,1000))
a2 = CuArray([1,2,3,4,5,6,7,8,9,10,11,12])
a1 = CuArray(randn(Float32,1000,1,20))

display(@benchmark CUDA.@sync @cuda threads=512 blocks=2  slow($a1,$a2,$ff,$tt,100))
 
glay = CuArray(fill(1,1000))
glay2 = CuArray(fill(1000,1000))
ff = CuArray(fill(4,1000,1000))
tt = CuArray(fill(10,1000,1000))
a2 = CuArray([1,2,3,4,5,6,7,8,9,10,11,12])
a1 = CuArray(randn(Float32,1000,1,20))
display(@benchmark CUDA.@sync @cuda threads=512 blocks=2  fast($a1,$a2,$ff,$tt,$glay,$glay2))

Results are crazy!

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  276.518 μs …  10.874 ms  ┊ GC (min … max): 0.00% … 99.26%
 Time  (median):     279.631 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   300.160 μs ± 173.301 μs  ┊ GC (mean ± σ):  0.65% ±  1.39%

  █▃                                                            ▁
  ██▇▅▇█▄▇▆█▆▅▄▃▃▃▁▄▆▄▃█▆▃▄▃▃▃▁▁▁▃▁▁▁▁▃▃▄▄▇▇▅▄▃▁▄▃▃▃▃▄▃▄▃▁▄▄▃▃▄ █
  277 μs        Histogram: log(frequency) by time       1.02 ms <

 Memory estimate: 2.02 KiB, allocs estimate: 59.
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.940 μs … 121.277 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.491 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.986 μs ±   4.433 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▃▇█▇▆▇▅▅▃▃▂▁▁▁    ▁  ▁ ▁                                  ▂
  ▆████████████████████████████▇▇▇▆▇▆▆▆▇▇▆▆▆▅▆▅▆▅▆▅▅▅▄▅▄▅▄▄▆▄ █
  4.94 μs      Histogram: log(frequency) by time      10.5 μs <

Basically an ++5000% speedup?
What am I doing wrong?
This sounds very serious. I am basically using 1000 different values for the for loop indexing and it is times faster. But the loop “i” should be also 1000 different values… What is going on here…

Can the problem be I don’t use local memory?

Marcell_Havlik · December 29, 2021, 2:02pm

The diff between the two code is basically:

From this:

	i=1
	@inbounds while i <= layer
                ..... some special calculation with while loop
		glay[I]+=1
	end

To this:

	@inbounds while glay[I] <= glay2[I]
                ..... some special calculation with while loop
		i+=1
	end

Sounds like the local “i” iterator in the while/for loop is problematic when nested loop comes into play.

Marcell_Havlik · December 29, 2021, 2:51pm

Oh… can it happen, that it only run once? Each other time it just measured 0 iteration.

Per · December 29, 2021, 3:22pm

Easiest way to find out would be to run both versions of the code on the same set of inputs and compare the results, no?

Marcell_Havlik · December 29, 2021, 3:35pm

The results are great both time.

The problem is that, @benchmark calculated the result once and 9999 times did basically 0 iteration as it already reached the end of the loop each case.

“When you realise you hacked your videocard and reached infinite speed and realise you just modded your code as many time as you made a mistake and basically did nothing.” XD

Topic		Replies	Views
CUDA Speed drop Performance performance , cuda	4	477	August 23, 2020
@inbounds code slower than one without General Usage	17	2358	March 9, 2019
Slow first run inside functions GPU	5	1671	February 4, 2019
GPU kernel optimization (GPU vs CPU) GPU	3	1528	December 14, 2018
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019

CUDA unexplainable SPEEDUP! Local memory?

Related topics