There’s also “warmup” accounted to in the BenchmarkTools code.
It’s a can of worms, I didn’t read that 7-page long paper on it (most issues are shared by other languages):
Consecutive timing measurements can fluctuate, possibly in a correlated manner, in ways which depend on a myriad of factors such as environment temperature, workload, power availability, and network traffic, and […]
Many factors stem from OS behavior, including CPU frequency scaling [2], address space layout randomization (ASLR) [3], virtual memory manage-ment [4], [5], differences between CPU privilege levels [6], context switches due to interrupt handling [7], activity from […]
Authors have also noted the poor [statistical] power of standard techniques such as F-tests or Student t-tests for benchmark timings [10], [15], [17]–[19]. Parametric outlier detection techniques, such as the 3-sigma rule used in benchmarking software like AndroBench[20], can also fail when applied to non-i.i.d. timing measurements.There is a lack of consensus over how non-ideal timing measurements should be treated. […]
To the best of our knowledge, our work is the first benchmarking methodology that can be fully automated, is robust in its assumption of non-i.i.d. timing measurement statistics, and makes efficient use of a limited time budget.
I also link unread, from March 2020 (something I should read myself I guess):
https://dzone.com/articles/introduction-to-benchmarking-in-julia
Yet, comparing the times above, for all statistics pre-allocating the array is slightly worse, even though we’re passing the compiler more knowledge upfront. This didn’t sit well with me, so I consulted the BenchmarkTools.jl manual […]
If you can avoid garbage collection, using six threads here gives nearly a 10x speedup, and at the median where both single-threaded and multi-threaded trigger garbage collection you still get a 2x speedup.