Curious case of FFTW performance benchmarking

vvbond · May 22, 2023, 12:31pm

While benchmarking FFTW’s rfft() transform, I’ve stumbled upon an interesting behaviour that I’d like to pick the collective brain about.
The behaviour goes as follows:

bm = @benchmark rfft(A,1) setup=(A=rand(Float32, nfft, howmany)) shows a bimodal times distribution:

julia> bm
BenchmarkTools.Trial: 3060 samples with 1 evaluation.
 Range (min … max):  822.167 μs …   2.177 ms  ┊ GC (min … max):  0.00% … 40.78%
 Time  (median):       1.013 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.151 ms ± 309.666 μs  ┊ GC (mean ± σ):  11.68% ± 16.34%

          ▁▆█▇▄▂▁                                        ▂▄▄▃▂  ▁
  ▃▁▁▁▁▁▁▁████████▇▄▄▄▄▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▄▇██████ █
  822 μs        Histogram: log(frequency) by time        1.9 ms <

 Memory estimate: 3.92 MiB, allocs estimate: 27.

However, when assessing the elapsed time in a single run, @elapsed always returns a larger figure. E.g.:

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002405417

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.003051375

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002866084

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002906042

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.00287125

The expected behaviour would be for @elapsed to return samples from the above bimodal distribution. This expectation turns out to be wrong. I wonder why?

Trying to understand, what’s going inside benchmarking, I’ve rolled out a custom loop that incorporates some sleep time between the calls to rfft(). The script is listed in the 1st comment to this post. The results are shown in the figure below

image1492×660 214 KB

The left subplot confirms that the custom loop’s timings agree with those obtained with the @benchmark macro. The right subplot shows the impact of the “nap times” on the transform’s apparent performance.

The questions which I’d love to get some help in understanding are:

The reason(s) for bimodal distribution of the rfft() timings.
The observed dependence of the rfft() timings on temporal frequency of rfft() calls.
Ultimately, how such observations could inform a design of a high-performance computing system.

vvbond · May 22, 2023, 12:32pm

The script used to produce the plots in the OP:

using FFTW, BenchmarkTools, GLMakie

nfft, howmany = 1024, 1000

@info "Benchmarking FFTW.rfft()"
bm = @benchmark rfft(A,1) setup=(A=rand(Float32, nfft, howmany))

@info "Custom loop"
n = 3000; 
te = zeros(n)

for i in 1:n
  A = rand(Float32, nfft, howmany)
  te[i] = @elapsed rfft(A,1); 
end

sleeptimes = [0.0, 0.001, 0.01]
te_sleep = zeros(n, length(sleeptimes))
for (runix, nap) in enumerate(sleeptimes)
  @show nap
  for i in 1:n
    A = rand(Float32, nfft, howmany)
    te_sleep[i, runix] = @elapsed rfft(A,1); 
    sleep(nap)
  end
end

@info "Plotting" 
fig = Figure()
ax1 = Axis(fig[1,1]; xlabel = "Iteration", ylabel = "Elapsed [s]", title = "FFTW.rfft() timings")
ax2 = Axis(fig[1,2]; xlabel = "Iteration", ylabel = "Elapsed [s]", title = "FFTW.rfft() timings")
scatter!(ax1, bm.times*1e-9, label="@benchmark")
scatter!(ax1, te, label="custom loop")
scatter!(ax2, te_sleep[:, 1], label="sleep(0.0)")
scatter!(ax2, te_sleep[:, 2], label="sleep(0.001)")
scatter!(ax2, te_sleep[:, 3], label="sleep(0.01)")
axislegend(ax1)
axislegend(ax2)
fig

The script was executed in the following environment:

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_EDITOR = gvim

jishnub · May 22, 2023, 12:46pm

I can’t really reproduce this, but in any case, for single benchmarks, I’d use @btime instead of @elapsed to avoid random noise in the measurements.

julia> bm = @benchmark rfft(A,1) setup=(A=rand(Float32, nfft, howmany))
BenchmarkTools.Trial: 1861 samples with 1 evaluation.
 Range (min … max):  2.032 ms …   4.395 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.179 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.246 ms ± 215.493 μs  ┊ GC (mean ± σ):  2.51% ± 4.89%

    ▁▅▂  ▄█▇▄▁                                                 
  ▄▆████▇█████▆▅▄▃▃▃▃▃▂▂▂▂▃▃▃▃▄▄▃▃▅▅▄▄▃▃▃▃▂▂▂▂▁▂▂▁▁▁▁▁▂▂▁▁▂▁▂ ▃
  2.03 ms         Histogram: frequency by time           3 ms <

 Memory estimate: 3.92 MiB, allocs estimate: 27.

julia> @btime rfft(A,1) setup=(A=rand(Float32, nfft, howmany));
  2.032 ms (27 allocations: 3.92 MiB)

julia> @btime rfft(A,1) setup=(A=rand(Float32, nfft, howmany));
  2.058 ms (27 allocations: 3.92 MiB)

julia> A=rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002532687

julia> A=rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.003060502

robsmith11 · May 22, 2023, 12:49pm

My guess is the variable performance is related to CPU frequency scaling. Can enable a fixed frequency and repeat the tests?

vvbond · May 22, 2023, 2:04pm

Well, that’s the reason for my confusion: I was expecting to observe noise, but instead found a genuine bias.

vvbond · May 22, 2023, 2:13pm

Good point. I’ll try to find out how to disable cpu frequency scaling on my Mac-M1.
Should anyone have done it before, an advice would be appreciated.

vvbond · May 22, 2023, 5:06pm

Apparently, it is not easy to disable the frequency scaling on a Mac-M1. However, monitoring the cpus frequency did confirm your guess: when the sleep() command is involved, the cpu gets throttled.
Thanks for that!

Topic		Replies	Views
Different times with @elapsed, @time and @benchmark Performance question	3	399	September 17, 2024
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7019	December 19, 2016
FFTW Performance General Usage fftw , performance	25	4432	November 2, 2022
RustFFT.jl v0.2: AbstractFFTs interface and performance improvements Package Announcements	6	785	August 5, 2023
[ANN] RustFFT.jl: Compute forward and inverse FFTs with RustFFT Package Announcements	13	1531	May 30, 2023

Curious case of FFTW performance benchmarking

Related topics