Benchmarking and Pkg.test()

This post is an offspring of the following discussion, where I reported inconsistencies in the benchmarked performances of Base.sum between machines.

To summarize the issue (which is illustrated a bit more clearly below) while benchmarking summation algorithms as compared to Base.sum, I collected a few results coming from various colleagues’ machines. And noticed large variations in the benchmarked performances of Base.sum across machines

I suggested that this might have to do with vectorization, and @mbauman provided ways to check whether the SSE/AVX/AVX2/AVX512 capabilities of the CPU explained these differences.

It turns out that instead, these variations had to do with how these benchmarks were run: from a standalone call of the julia compiler, or via Pkg.test()

Here is a very simple example of a mostly empty package, in which the test/runtests.jl file has the following contents:

using BenchmarkTools
@btime sum($(rand(1_000)))

Then, we get: when running the test file from the command line:

> julia --project -O3 test/runtests.jl 
  84.877 ns (0 allocations: 0 bytes)

but when running from Pkg.test():

shell> julia --project --quiet -O3
julia> using Pkg; Pkg.test()
   Testing BenchSum
 Resolving package versions...
  795.624 ns (0 allocations: 0 bytes)
   Testing BenchSum tests passed 

In case anyone is wondering, the situation is almost exactly the same without the -O3 flag.

That’s nearly a 10x slow-down! Is it expected? Does anyone have an idea why this happens?

1 Like

Tests run with --check-bounds=yes.

3 Likes

Thanks! That explains everything.

1 Like

I check generated IR in tests and realized that other flags like --code-coverage changes the result as well. The workaround I’ve been using is to launch a subprocess without problematic flags. Since doing this in each package is tedious, I created a helper package that does it: https://tkf.github.io/IRTest.jl/dev/ (Just FYI)

3 Likes