Curious case of FFTW performance benchmarking

While benchmarking FFTW’s rfft() transform, I’ve stumbled upon an interesting behaviour that I’d like to pick the collective brain about.
The behaviour goes as follows:

  1. bm = @benchmark rfft(A,1) setup=(A=rand(Float32, nfft, howmany)) shows a bimodal times distribution:
julia> bm
BenchmarkTools.Trial: 3060 samples with 1 evaluation.
 Range (min … max):  822.167 μs …   2.177 ms  ┊ GC (min … max):  0.00% … 40.78%
 Time  (median):       1.013 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.151 ms ± 309.666 μs  ┊ GC (mean ± σ):  11.68% ± 16.34%

          ▁▆█▇▄▂▁                                        ▂▄▄▃▂  ▁
  ▃▁▁▁▁▁▁▁████████▇▄▄▄▄▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▄▇██████ █
  822 μs        Histogram: log(frequency) by time        1.9 ms <

 Memory estimate: 3.92 MiB, allocs estimate: 27.
  1. However, when assessing the elapsed time in a single run, @elapsed always returns a larger figure. E.g.:
julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002405417

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.003051375

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002866084

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002906042

julia> A = rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.00287125 

The expected behaviour would be for @elapsed to return samples from the above bimodal distribution. This expectation turns out to be wrong. I wonder why?

  1. Trying to understand, what’s going inside benchmarking, I’ve rolled out a custom loop that incorporates some sleep time between the calls to rfft(). The script is listed in the 1st comment to this post. The results are shown in the figure below

    The left subplot confirms that the custom loop’s timings agree with those obtained with the @benchmark macro. The right subplot shows the impact of the “nap times” on the transform’s apparent performance.

The questions which I’d love to get some help in understanding are:

  1. The reason(s) for bimodal distribution of the rfft() timings.
  2. The observed dependence of the rfft() timings on temporal frequency of rfft() calls.
  3. Ultimately, how such observations could inform a design of a high-performance computing system.

The script used to produce the plots in the OP:

using FFTW, BenchmarkTools, GLMakie

nfft, howmany = 1024, 1000

@info "Benchmarking FFTW.rfft()"
bm = @benchmark rfft(A,1) setup=(A=rand(Float32, nfft, howmany))

@info "Custom loop"
n = 3000; 
te = zeros(n)

for i in 1:n
  A = rand(Float32, nfft, howmany)
  te[i] = @elapsed rfft(A,1); 
end

sleeptimes = [0.0, 0.001, 0.01]
te_sleep = zeros(n, length(sleeptimes))
for (runix, nap) in enumerate(sleeptimes)
  @show nap
  for i in 1:n
    A = rand(Float32, nfft, howmany)
    te_sleep[i, runix] = @elapsed rfft(A,1); 
    sleep(nap)
  end
end

@info "Plotting" 
fig = Figure()
ax1 = Axis(fig[1,1]; xlabel = "Iteration", ylabel = "Elapsed [s]", title = "FFTW.rfft() timings")
ax2 = Axis(fig[1,2]; xlabel = "Iteration", ylabel = "Elapsed [s]", title = "FFTW.rfft() timings")
scatter!(ax1, bm.times*1e-9, label="@benchmark")
scatter!(ax1, te, label="custom loop")
scatter!(ax2, te_sleep[:, 1], label="sleep(0.0)")
scatter!(ax2, te_sleep[:, 2], label="sleep(0.001)")
scatter!(ax2, te_sleep[:, 3], label="sleep(0.01)")
axislegend(ax1)
axislegend(ax2)
fig

The script was executed in the following environment:

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_EDITOR = gvim
1 Like

I can’t really reproduce this, but in any case, for single benchmarks, I’d use @btime instead of @elapsed to avoid random noise in the measurements.

julia> bm = @benchmark rfft(A,1) setup=(A=rand(Float32, nfft, howmany))
BenchmarkTools.Trial: 1861 samples with 1 evaluation.
 Range (min … max):  2.032 ms …   4.395 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.179 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.246 ms ± 215.493 μs  ┊ GC (mean ± σ):  2.51% ± 4.89%

    ▁▅▂  ▄█▇▄▁                                                 
  ▄▆████▇█████▆▅▄▃▃▃▃▃▂▂▂▂▃▃▃▃▄▄▃▃▅▅▄▄▃▃▃▃▂▂▂▂▁▂▂▁▁▁▁▁▂▂▁▁▂▁▂ ▃
  2.03 ms         Histogram: frequency by time           3 ms <

 Memory estimate: 3.92 MiB, allocs estimate: 27.

julia> @btime rfft(A,1) setup=(A=rand(Float32, nfft, howmany));
  2.032 ms (27 allocations: 3.92 MiB)

julia> @btime rfft(A,1) setup=(A=rand(Float32, nfft, howmany));
  2.058 ms (27 allocations: 3.92 MiB)

julia> A=rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.002532687

julia> A=rand(Float32, nfft, howmany); @elapsed rfft(A, 1)
0.003060502

My guess is the variable performance is related to CPU frequency scaling. Can enable a fixed frequency and repeat the tests?

Well, that’s the reason for my confusion: I was expecting to observe noise, but instead found a genuine bias.

Good point. I’ll try to find out how to disable cpu frequency scaling on my Mac-M1.
Should anyone have done it before, an advice would be appreciated.

Apparently, it is not easy to disable the frequency scaling on a Mac-M1. However, monitoring the cpus frequency did confirm your guess: when the sleep() command is involved, the cpu gets throttled.
Thanks for that!

1 Like