A further complication is the use of a constant value as the benchmark input in @btime sin($x)
. The ‘pseudo-interpolation’ (which is specific to these BenchmarkTools macros by the way) makes it so that x
is treated as a compile-time constant, so constant-propagation may occur, altering the result. You can see that, for example, with
julia> x = 2; @btime 2 * $x;
0.033 ns (0 allocations: 0 bytes)
julia> x = Ref(2); @btime 2 * $x[];
1.548 ns (0 allocations: 0 bytes)
The first time is less than a CPU clock cycle, so literally no work is being done, as the result 4
is just computed at compile time. This is also the case with sin
. Compare:
julia> x = 0.5; @btime sin($x); #1
3.933 ns (0 allocations: 0 bytes)
julia> @btime sin(0.5); #2
1.502 ns (0 allocations: 0 bytes)
julia> @btime sin(3.5); #3
0.033 ns (0 allocations: 0 bytes)
julia> x = Ref(3.5); @btime sin($x[]); #4
5.752 ns (0 allocations: 0 bytes)
julia> @btime sin(x) setup = (x = rand()); #5
3.839 ns (0 allocations: 0 bytes)
julia> @btime sin(rand()); # 6
8.266 ns (0 allocations: 0 bytes)
julia> N = 100_000; x = (4 * pi) .* (rand(N) .- 0.5);
julia> b = @benchmark begin #7
@inbounds for i in 1 : $N
sin($x[i])
end
end;
julia> minimum(b).time / N # ns
13.5881
Seven dubious ways to benchmark sin
.
I don’t fully understand the difference between 1 and 2. However:
- I think the peculiar difference between 2 and 3 is probably because the particular path that’s being taken through the if statements in the implementation of
sin
forx = 3.5
is easier for the compiler to statically analyze than the one forx = 0.5
, and so the compiler is able to constant-fold / constant-propagate the whole computation. - The
Ref
used in 4 prevents the compiler from seeing the value ofx
as a constant. However, modern CPUs are very good at branch prediction, so after a while the particular path through theif
statements insin
will be predicted correctly every time, resulting in a possible speedup. - In 5, the
setup
argument is used to produce random numbers to pass intosin
, but only every few evaluations, so branch prediction may still be an issue. In fact, sincerand
produces numbers between 0 and 1, it is likely that the same branch is being taken every time anyway (and probably the same one as forsin(0.5)
, hence the similar result). - In 6, a new random number is computed at every evaluation, but now you’re also measuring the time spent in
rand
. - In 7, we’re iterating over 100000 precomputed inputs between
-2 * pi
and2 * pi
, so we’re hitting more of the branches insin
, without counting the time for random number generation. IncreasingN
above 100000 doesn’t appreciably change the result, so we can also be fairly certain that the branch predictor is defeated to a certain extent. However, it could be possible for the compiler to use more SIMD instructions due to the for loop (an@noinline
function barrier could be used to shield against this). Also,x
is still iterated over a number of times (multiple samples used by BenchmarkTools) and it fits into L3 cache (not L2 or L1). DecreasingN
so thatx
does fit into L2 or L1 cache can change the result significantly.
See also this very interesting topic about branch prediction: PSA: Microbenchmarks remember branch history.
So an important question is: which scenario is closest to your use case?