This post is an offspring of the following discussion, where I reported inconsistencies in the benchmarked performances of Base.sum
between machines.
To summarize the issue (which is illustrated a bit more clearly below) while benchmarking summation algorithms as compared to Base.sum
, I collected a few results coming from various colleagues’ machines. And noticed large variations in the benchmarked performances of Base.sum
across machines
I suggested that this might have to do with vectorization, and @mbauman provided ways to check whether the SSE/AVX/AVX2/AVX512 capabilities of the CPU explained these differences.
It turns out that instead, these variations had to do with how these benchmarks were run: from a standalone call of the julia compiler, or via Pkg.test()
Here is a very simple example of a mostly empty package, in which the test/runtests.jl
file has the following contents:
using BenchmarkTools
@btime sum($(rand(1_000)))
Then, we get: when running the test file from the command line:
> julia --project -O3 test/runtests.jl
84.877 ns (0 allocations: 0 bytes)
but when running from Pkg.test()
:
shell> julia --project --quiet -O3
julia> using Pkg; Pkg.test()
Testing BenchSum
Resolving package versions...
795.624 ns (0 allocations: 0 bytes)
Testing BenchSum tests passed
In case anyone is wondering, the situation is almost exactly the same without the -O3
flag.
That’s nearly a 10x slow-down! Is it expected? Does anyone have an idea why this happens?