Hey,
I was trying to implement a kernel function when I faced with this incredible big difference:
using BenchmarkTools
using CUDA
fast(v, nins,k,j,glay,glay2) = begin
I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if I > 1000
return
end
@inbounds while glay[I] <= glay2[I]
p=k[1]
while p < j[1]
v[I] += v[nins[p]+I+1000]
p+=1
end
glay[I]+=1
end
end
slow(v, nins,k,j,layer) = begin
I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if I > 1000
return
end
i=1
@inbounds while i <= layer
p=k[1]
while p < j[1]
v[I] += v[nins[p]+I+1000]
p+=1
end
i+=1
end
end
glay = CuArray(fill(1,1000))
glay2 = CuArray(fill(1,1000))
ff = CuArray(fill(4,1000,1000))
tt = CuArray(fill(10,1000,1000))
a2 = CuArray([1,2,3,4,5,6,7,8,9,10,11,12])
a1 = CuArray(randn(Float32,1000,1,20))
display(@benchmark CUDA.@sync @cuda threads=512 blocks=2 slow($a1,$a2,$ff,$tt,100))
glay = CuArray(fill(1,1000))
glay2 = CuArray(fill(1000,1000))
ff = CuArray(fill(4,1000,1000))
tt = CuArray(fill(10,1000,1000))
a2 = CuArray([1,2,3,4,5,6,7,8,9,10,11,12])
a1 = CuArray(randn(Float32,1000,1,20))
display(@benchmark CUDA.@sync @cuda threads=512 blocks=2 fast($a1,$a2,$ff,$tt,$glay,$glay2))
Results are crazy!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 276.518 μs … 10.874 ms ┊ GC (min … max): 0.00% … 99.26%
Time (median): 279.631 μs ┊ GC (median): 0.00%
Time (mean ± σ): 300.160 μs ± 173.301 μs ┊ GC (mean ± σ): 0.65% ± 1.39%
█▃ ▁
██▇▅▇█▄▇▆█▆▅▄▃▃▃▁▄▆▄▃█▆▃▄▃▃▃▁▁▁▃▁▁▁▁▃▃▄▄▇▇▅▄▃▁▄▃▃▃▃▄▃▄▃▁▄▄▃▃▄ █
277 μs Histogram: log(frequency) by time 1.02 ms <
Memory estimate: 2.02 KiB, allocs estimate: 59.
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.940 μs … 121.277 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.491 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.986 μs ± 4.433 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▇█▇▆▇▅▅▃▃▂▁▁▁ ▁ ▁ ▁ ▂
▆████████████████████████████▇▇▇▆▇▆▆▆▇▇▆▆▆▅▆▅▆▅▆▅▅▅▄▅▄▅▄▄▆▄ █
4.94 μs Histogram: log(frequency) by time 10.5 μs <
Basically an ++5000% speedup?
What am I doing wrong?
This sounds very serious. I am basically using 1000 different values for the for loop indexing and it is times faster. But the loop “i” should be also 1000 different values… What is going on here…
Can the problem be I don’t use local memory?