Trying to understand CUDA benchmark

Jakub_Mitura · December 26, 2021, 9:00am

I am benchmarking my kernel that is relatively simple

it iterates over 2 arrays of the same size and accumulates in local variables times both arrays agree or not
warp reduction of local variables
thread block reduction using shared memory

What is hard to understand for me is source of huge variability of kernel speed as seen in the image below - all other non system applications are closed - for a context benchmarking is done on RTX 3080 Windows 10 arrays are of size 826×512×512 Array{Float32, 3} - every time the same arrays are used

code

BenchmarkTools.DEFAULT_PARAMETERS.samples = 500
BenchmarkTools.DEFAULT_PARAMETERS.seconds =600
BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true
@benchmark CUDA.@sync kernelFunction()

Similar function in PyTorch is taking on the same data 24 ms - I am trying to get in this range or better - yet there is some problem with benchamrking as I see - what I am doing wrong?

maleadt · December 28, 2021, 8:33am

Hard to tell without an MWE. Try running under NSight Systems with CUDA.@profile. You can additionally use NVTX.@range to mark specific parts of your application and visualize them in the timeline. If you then run multiple invocations it’ll hopefully be clear where the variability comes from.

Topic		Replies	Views
Timing square function in CUDA GPU	4	1695	December 11, 2018
GPU kernel optimization (GPU vs CPU) GPU	3	1517	December 14, 2018
CUDA Speed drop Performance performance , cuda	4	468	August 23, 2020
Slow first run inside functions GPU	5	1662	February 4, 2019
Performance of view with cuArrays GPU	11	2693	November 11, 2018

Trying to understand CUDA benchmark

Related topics