Help Debugging GPU Performance Issue

I’m having trouble debugging the following issue with CuArrays. I see very different performance using the findmax function when I call it from the REPL vs when I use it in my application. From the REPL I get the desired “good” performance, which I can reproduce in Example 1 below. In my application I always see the “bad” performance in Example 2. In my application I’m calling the function as in Example 1, but am seeing the performance from Example 2.

Rather than posting my more complicated code, I’m wondering what is the best way to figure out what is going on under the hood in each case and how to fix it.

Example 1:

a=CuArrays.rand(64)
@btime findmax(a)

196.191 μs (265 allocations: 8.42 KiB)

Example 2:

@btime findmax(CuArrays.rand(64))

108.159 ms (41192 allocations: 2.99 MiB)

FYI there is almost no overhead to creating the array so that can’t account for the difference.

@btime CuArrays.rand(64)

9.079 μs (6 allocations: 144 bytes)

The way you are measuring time is incorrectly, currently you are measuring CPU time to launch the kernel rather than the kernel itself. To avoid this add CUDA.@sync in front the @time/@btime or add CUDA.synchronize() at the end of your block. Doing so on my machine, I got:

julia> @benchmark CUDA.@sync begin
           a = CUDA.rand(64)
           findmax(a)
       end
BenchmarkTools.Trial: 
  memory estimate:  8.77 KiB
  allocs estimate:  283
  --------------
  minimum time:     340.618 μs (0.00% GC)
  median time:      28.718 ms (0.00% GC)
  mean time:        29.137 ms (1.16% GC)
  maximum time:     64.510 ms (0.00% GC)
  --------------
  samples:          172
  evals/sample:     1

julia> @benchmark CUDA.@sync begin
           findmax(CUDA.rand(64))
       end
BenchmarkTools.Trial: 
  memory estimate:  2.93 MiB
  allocs estimate:  39756
  --------------
  minimum time:     29.005 ms (0.00% GC)
  median time:      29.461 ms (0.00% GC)
  mean time:        29.906 ms (1.33% GC)
  maximum time:     36.513 ms (17.74% GC)
  --------------
  samples:          168
  evals/sample:     1

Now the only thing that pops out to my eyes is the allocations which can be explained by the unsafe_free!.

julia> @benchmark CUDA.@sync begin
           a = CUDA.rand(64)
           findmax(a)
           CUDA.unsafe_free!(a)
       end
BenchmarkTools.Trial: 
  memory estimate:  2.93 MiB
  allocs estimate:  39763
  --------------
  minimum time:     29.770 ms (0.00% GC)
  median time:      30.066 ms (0.00% GC)
  mean time:        30.578 ms (1.34% GC)
  maximum time:     37.684 ms (18.19% GC)
  --------------
  samples:          164
  evals/sample:     1
2 Likes

The allocations reported by BenchmarkTools are CPU allocations, and there are always some when launching kernels (we need to allocate kernel parameter buffers to pass to CUDA). To see GPU allocations, you can use CUDA.@time.

Yes, of course but here the allocs estimate is 283 vs 39763 where the latter case includes unsafe_free!. The memory estimate also jumps from 8.7 Kb to 2.9 Mb.

I assumed that kernel parameter buffers will be created and destroyed in both cases.

I was looking at the second and third of your measurements, which have identical allocation counts. Actually, I can’t reproduce what you were seeing, and get identical counts for all three of your benchmarks. It shouldn’t matter if you assign the array to a variable or not (and unsafe_free!) shouldn’t allocate.

My test was on CUDA 1.0, retrying on master I am getting same amount of allocation estimate. Maybe this needs to be looked into.