A further complication is the use of a constant value as the benchmark input in @btime sin($x). The ‘pseudo-interpolation’ (which is specific to these BenchmarkTools macros by the way) makes it so that x is treated as a compile-time constant, so constant-propagation may occur, altering the result. You can see that, for example, with
julia> x = 2; @btime 2 * $x;
0.033 ns (0 allocations: 0 bytes)
julia> x = Ref(2); @btime 2 * $x[];
1.548 ns (0 allocations: 0 bytes)
The first time is less than a CPU clock cycle, so literally no work is being done, as the result 4 is just computed at compile time. This is also the case with sin. Compare:
julia> x = 0.5; @btime sin($x); #1
3.933 ns (0 allocations: 0 bytes)
julia> @btime sin(0.5); #2
1.502 ns (0 allocations: 0 bytes)
julia> @btime sin(3.5); #3
0.033 ns (0 allocations: 0 bytes)
julia> x = Ref(3.5); @btime sin($x[]); #4
5.752 ns (0 allocations: 0 bytes)
julia> @btime sin(x) setup = (x = rand()); #5
3.839 ns (0 allocations: 0 bytes)
julia> @btime sin(rand()); # 6
8.266 ns (0 allocations: 0 bytes)
julia> N = 100_000; x = (4 * pi) .* (rand(N) .- 0.5);
julia> b = @benchmark begin #7
@inbounds for i in 1 : $N
sin($x[i])
end
end;
julia> minimum(b).time / N # ns
13.5881
Seven dubious ways to benchmark sin.
I don’t fully understand the difference between 1 and 2. However:
- I think the peculiar difference between 2 and 3 is probably because the particular path that’s being taken through the if statements in the implementation of
sinforx = 3.5is easier for the compiler to statically analyze than the one forx = 0.5, and so the compiler is able to constant-fold / constant-propagate the whole computation. - The
Refused in 4 prevents the compiler from seeing the value ofxas a constant. However, modern CPUs are very good at branch prediction, so after a while the particular path through theifstatements insinwill be predicted correctly every time, resulting in a possible speedup. - In 5, the
setupargument is used to produce random numbers to pass intosin, but only every few evaluations, so branch prediction may still be an issue. In fact, sincerandproduces numbers between 0 and 1, it is likely that the same branch is being taken every time anyway (and probably the same one as forsin(0.5), hence the similar result). - In 6, a new random number is computed at every evaluation, but now you’re also measuring the time spent in
rand. - In 7, we’re iterating over 100000 precomputed inputs between
-2 * piand2 * pi, so we’re hitting more of the branches insin, without counting the time for random number generation. IncreasingNabove 100000 doesn’t appreciably change the result, so we can also be fairly certain that the branch predictor is defeated to a certain extent. However, it could be possible for the compiler to use more SIMD instructions due to the for loop (an@noinlinefunction barrier could be used to shield against this). Also,xis still iterated over a number of times (multiple samples used by BenchmarkTools) and it fits into L3 cache (not L2 or L1). DecreasingNso thatxdoes fit into L2 or L1 cache can change the result significantly.
See also this very interesting topic about branch prediction: PSA: Microbenchmarks remember branch history.
So an important question is: which scenario is closest to your use case?