I understand that BenchmarkTools’ job is a hard one, but is it expected that I frequently see several regressions while rerunning the same benchmark (without code change)? Is there some theory that (under certain assumptions) predict how often I should see a false-positive?
Is that equivalent? Furthermore, shouldn’t there be a warmup(bench) call before second_run? It seems like a reasonable thing to do, to get rid of the JIT from the code change, but the source doesn’t seem to do that.
Yes, they should be (minus JIT overhead for the second run).
For reference, a lot of the theory behind BenchmarkTools is described in this paper.
Furthermore, shouldn’t there be a warmup(bench) call before second_run
Yes. It’s up to the user to add this, though. We should probably add something in the docs that says “if you don’t run the tuning process on a benchmark, and you care about warming up the benchmark to get rid of JIT overhead, you should manually call warmup before running the benchmark.”
It seems like a reasonable thing to do, to get rid of the JIT from the code change, but the source doesn’t seem to do that.
This was a purposeful decision.
For “power users” (e.g. folks that use @benchmarkable instead @benchmark), BenchmarkTools should always respect user settings instead of making decisions for the user. This is because BenchmarkTools doesn’t have enough knowledge to decide in the general case whether it’s “correct” to execute the benchmark kernel “a hidden extra time” for the sake of getting rid of JIT overhead (e.g. the kernel could be non-idempotent or have side effects).
In the more naive use case (e.g. @btime and @benchmark), the tuning process takes care of JIT overhead for the users.