The way you are measuring time is incorrectly, currently you are measuring CPU time to launch the kernel rather than the kernel itself. To avoid this add CUDA.@sync
in front the @time/@btime
or add CUDA.synchronize()
at the end of your block. Doing so on my machine, I got:
julia> @benchmark CUDA.@sync begin
a = CUDA.rand(64)
findmax(a)
end
BenchmarkTools.Trial:
memory estimate: 8.77 KiB
allocs estimate: 283
--------------
minimum time: 340.618 μs (0.00% GC)
median time: 28.718 ms (0.00% GC)
mean time: 29.137 ms (1.16% GC)
maximum time: 64.510 ms (0.00% GC)
--------------
samples: 172
evals/sample: 1
julia> @benchmark CUDA.@sync begin
findmax(CUDA.rand(64))
end
BenchmarkTools.Trial:
memory estimate: 2.93 MiB
allocs estimate: 39756
--------------
minimum time: 29.005 ms (0.00% GC)
median time: 29.461 ms (0.00% GC)
mean time: 29.906 ms (1.33% GC)
maximum time: 36.513 ms (17.74% GC)
--------------
samples: 168
evals/sample: 1
Now the only thing that pops out to my eyes is the allocations which can be explained by the unsafe_free!
.
julia> @benchmark CUDA.@sync begin
a = CUDA.rand(64)
findmax(a)
CUDA.unsafe_free!(a)
end
BenchmarkTools.Trial:
memory estimate: 2.93 MiB
allocs estimate: 39763
--------------
minimum time: 29.770 ms (0.00% GC)
median time: 30.066 ms (0.00% GC)
mean time: 30.578 ms (1.34% GC)
maximum time: 37.684 ms (18.19% GC)
--------------
samples: 164
evals/sample: 1