I am benchmarking my kernel that is relatively simple
- it iterates over 2 arrays of the same size and accumulates in local variables times both arrays agree or not
- warp reduction of local variables
- thread block reduction using shared memory
What is hard to understand for me is source of huge variability of kernel speed as seen in the image below - all other non system applications are closed - for a context benchmarking is done on RTX 3080 Windows 10 arrays are of size 826×512×512 Array{Float32, 3} - every time the same arrays are used
code
BenchmarkTools.DEFAULT_PARAMETERS.samples = 500
BenchmarkTools.DEFAULT_PARAMETERS.seconds =600
BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true
@benchmark CUDA.@sync kernelFunction()
Similar function in PyTorch is taking on the same data 24 ms - I am trying to get in this range or better - yet there is some problem with benchamrking as I see - what I am doing wrong?