Help Debugging GPU Performance Issue

Ellipse0934 · July 1, 2020, 6:46am

The way you are measuring time is incorrectly, currently you are measuring CPU time to launch the kernel rather than the kernel itself. To avoid this add CUDA.@sync in front the @time/@btime or add CUDA.synchronize() at the end of your block. Doing so on my machine, I got:

julia> @benchmark CUDA.@sync begin
           a = CUDA.rand(64)
           findmax(a)
       end
BenchmarkTools.Trial: 
  memory estimate:  8.77 KiB
  allocs estimate:  283
  --------------
  minimum time:     340.618 μs (0.00% GC)
  median time:      28.718 ms (0.00% GC)
  mean time:        29.137 ms (1.16% GC)
  maximum time:     64.510 ms (0.00% GC)
  --------------
  samples:          172
  evals/sample:     1

julia> @benchmark CUDA.@sync begin
           findmax(CUDA.rand(64))
       end
BenchmarkTools.Trial: 
  memory estimate:  2.93 MiB
  allocs estimate:  39756
  --------------
  minimum time:     29.005 ms (0.00% GC)
  median time:      29.461 ms (0.00% GC)
  mean time:        29.906 ms (1.33% GC)
  maximum time:     36.513 ms (17.74% GC)
  --------------
  samples:          168
  evals/sample:     1

Now the only thing that pops out to my eyes is the allocations which can be explained by the unsafe_free!.

julia> @benchmark CUDA.@sync begin
           a = CUDA.rand(64)
           findmax(a)
           CUDA.unsafe_free!(a)
       end
BenchmarkTools.Trial: 
  memory estimate:  2.93 MiB
  allocs estimate:  39763
  --------------
  minimum time:     29.770 ms (0.00% GC)
  median time:      30.066 ms (0.00% GC)
  mean time:        30.578 ms (1.34% GC)
  maximum time:     37.684 ms (18.19% GC)
  --------------
  samples:          164
  evals/sample:     1

Topic		Replies	Views
What is the optimal way of updating CuArray? GPU cudanative	7	1533	July 5, 2018
Memory usage problem when using findmax/min GPU	9	887	December 29, 2022
GPU randn way slower than rand? Performance gpu , cuda	6	1592	December 3, 2018
Using CUSOLVER in CuArrays.jl GPU question	12	1877	February 26, 2019
The performance difference of transferring (SubArray, ReshapedArray) Array to GPU GPU flux	2	655	November 20, 2019

Help Debugging GPU Performance Issue

Related topics