Trying to understand CUDA benchmark

I am benchmarking my kernel that is relatively simple

  1. it iterates over 2 arrays of the same size and accumulates in local variables times both arrays agree or not
  2. warp reduction of local variables
  3. thread block reduction using shared memory

What is hard to understand for me is source of huge variability of kernel speed as seen in the image below - all other non system applications are closed - for a context benchmarking is done on RTX 3080 Windows 10 arrays are of size 826×512×512 Array{Float32, 3} - every time the same arrays are used

code

BenchmarkTools.DEFAULT_PARAMETERS.samples = 500
BenchmarkTools.DEFAULT_PARAMETERS.seconds =600
BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true
@benchmark CUDA.@sync kernelFunction()

image

Similar function in PyTorch is taking on the same data 24 ms - I am trying to get in this range or better - yet there is some problem with benchamrking as I see - what I am doing wrong?

Hard to tell without an MWE. Try running under NSight Systems with CUDA.@profile. You can additionally use NVTX.@range to mark specific parts of your application and visualize them in the timeline. If you then run multiple invocations it’ll hopefully be clear where the variability comes from.

1 Like