While benchmarking FFTW’s rfft() transform, I’ve stumbled upon an interesting behaviour that I’d like to pick the collective brain about.
The behaviour goes as follows:
bm = @benchmark rfft(A,1) setup=(A=rand(Float32, nfft, howmany))
shows a bimodal times distribution:
julia> bm
BenchmarkTools.Trial: 3060 samples with 1 evaluation.
Range (min … max): 822.167 μs … 2.177 ms ┊ GC (min … max): 0.00% … 40.78%
Time (median): 1.013 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.151 ms ± 309.666 μs ┊ GC (mean ± σ): 11.68% ± 16.34%
▁▆█▇▄▂▁ ▂▄▄▃▂ ▁
▃▁▁▁▁▁▁▁████████▇▄▄▄▄▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▄▇██████ █
822 μs Histogram: log(frequency) by time 1.9 ms <
Memory estimate: 3.92 MiB, allocs estimate: 27.
- However, when assessing the elapsed time in a single run,
@elapsed
always returns a larger figure. E.g.:
julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002405417
julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.003051375
julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002866084
julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002906042
julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.00287125
The expected behaviour would be for @elapsed
to return samples from the above bimodal distribution. This expectation turns out to be wrong. I wonder why?
- Trying to understand, what’s going inside benchmarking, I’ve rolled out a custom loop that incorporates some sleep time between the calls to rfft(). The script is listed in the 1st comment to this post. The results are shown in the figure below
The left subplot confirms that the custom loop’s timings agree with those obtained with the@benchmark
macro. The right subplot shows the impact of the “nap times” on the transform’s apparent performance.
The questions which I’d love to get some help in understanding are:
- The reason(s) for bimodal distribution of the rfft() timings.
- The observed dependence of the rfft() timings on temporal frequency of rfft() calls.
- Ultimately, how such observations could inform a design of a high-performance computing system.