Help Debugging GPU Performance Issue

The way you are measuring time is incorrectly, currently you are measuring CPU time to launch the kernel rather than the kernel itself. To avoid this add CUDA.@sync in front the @time/@btime or add CUDA.synchronize() at the end of your block. Doing so on my machine, I got:

julia> @benchmark CUDA.@sync begin
           a = CUDA.rand(64)
           findmax(a)
       end
BenchmarkTools.Trial: 
  memory estimate:  8.77 KiB
  allocs estimate:  283
  --------------
  minimum time:     340.618 μs (0.00% GC)
  median time:      28.718 ms (0.00% GC)
  mean time:        29.137 ms (1.16% GC)
  maximum time:     64.510 ms (0.00% GC)
  --------------
  samples:          172
  evals/sample:     1

julia> @benchmark CUDA.@sync begin
           findmax(CUDA.rand(64))
       end
BenchmarkTools.Trial: 
  memory estimate:  2.93 MiB
  allocs estimate:  39756
  --------------
  minimum time:     29.005 ms (0.00% GC)
  median time:      29.461 ms (0.00% GC)
  mean time:        29.906 ms (1.33% GC)
  maximum time:     36.513 ms (17.74% GC)
  --------------
  samples:          168
  evals/sample:     1

Now the only thing that pops out to my eyes is the allocations which can be explained by the unsafe_free!.

julia> @benchmark CUDA.@sync begin
           a = CUDA.rand(64)
           findmax(a)
           CUDA.unsafe_free!(a)
       end
BenchmarkTools.Trial: 
  memory estimate:  2.93 MiB
  allocs estimate:  39763
  --------------
  minimum time:     29.770 ms (0.00% GC)
  median time:      30.066 ms (0.00% GC)
  mean time:        30.578 ms (1.34% GC)
  maximum time:     37.684 ms (18.19% GC)
  --------------
  samples:          164
  evals/sample:     1
2 Likes