Benchmarking and Pkg.test()

ffevotte · August 19, 2019, 3:49pm

This post is an offspring of the following discussion, where I reported inconsistencies in the benchmarked performances of Base.sum between machines.

To summarize the issue (which is illustrated a bit more clearly below) while benchmarking summation algorithms as compared to Base.sum, I collected a few results coming from various colleagues’ machines. And noticed large variations in the benchmarked performances of Base.sum across machines

I suggested that this might have to do with vectorization, and @mbauman provided ways to check whether the SSE/AVX/AVX2/AVX512 capabilities of the CPU explained these differences.

It turns out that instead, these variations had to do with how these benchmarks were run: from a standalone call of the julia compiler, or via Pkg.test()

Here is a very simple example of a mostly empty package, in which the test/runtests.jl file has the following contents:

using BenchmarkTools
@btime sum($(rand(1_000)))

Then, we get: when running the test file from the command line:

> julia --project -O3 test/runtests.jl 
  84.877 ns (0 allocations: 0 bytes)

but when running from Pkg.test():

shell> julia --project --quiet -O3
julia> using Pkg; Pkg.test()
   Testing BenchSum
 Resolving package versions...
  795.624 ns (0 allocations: 0 bytes)
   Testing BenchSum tests passed

In case anyone is wondering, the situation is almost exactly the same without the -O3 flag.

That’s nearly a 10x slow-down! Is it expected? Does anyone have an idea why this happens?

mbauman · August 19, 2019, 3:51pm

Tests run with --check-bounds=yes.

ffevotte · August 19, 2019, 4:03pm

Thanks! That explains everything.

tkf · August 19, 2019, 4:06pm

I check generated IR in tests and realized that other flags like --code-coverage changes the result as well. The workaround I’ve been using is to launch a subprocess without problematic flags. Since doing this in each package is tedious, I created a helper package that does it: Home · IRTest.jl (Just FYI)

Topic		Replies	Views
Strange summation timings General Usage	3	418	August 17, 2018
Help understanding vectorization (or lack thereof) Performance	15	1212	June 8, 2018
Pkg.test() is orders of magnitude slower than include("test/runtests.jl") Performance	4	336	July 7, 2021
What can cause significantly different performance for pisum microbenchmark on different workstations Performance	11	1005	May 12, 2019
Yet another language benchmark Performance benchmark	9	914	June 15, 2025

Benchmarking and Pkg.test()

Related topics