Function calls in global scope, benchmarking, etc

tkoolen · August 14, 2019, 2:25am

A further complication is the use of a constant value as the benchmark input in @btime sin($x). The ‘pseudo-interpolation’ (which is specific to these BenchmarkTools macros by the way) makes it so that x is treated as a compile-time constant, so constant-propagation may occur, altering the result. You can see that, for example, with

julia> x = 2; @btime 2 * $x;
  0.033 ns (0 allocations: 0 bytes)

julia> x = Ref(2); @btime 2 * $x[];
  1.548 ns (0 allocations: 0 bytes)

The first time is less than a CPU clock cycle, so literally no work is being done, as the result 4 is just computed at compile time. This is also the case with sin. Compare:

julia> x = 0.5; @btime sin($x); #1
  3.933 ns (0 allocations: 0 bytes)

julia> @btime sin(0.5); #2
  1.502 ns (0 allocations: 0 bytes)

julia> @btime sin(3.5); #3
  0.033 ns (0 allocations: 0 bytes)

julia> x = Ref(3.5); @btime sin($x[]); #4
  5.752 ns (0 allocations: 0 bytes)

julia> @btime sin(x) setup = (x = rand()); #5
  3.839 ns (0 allocations: 0 bytes)

julia> @btime sin(rand()); # 6
  8.266 ns (0 allocations: 0 bytes)

julia> N = 100_000; x = (4 * pi) .* (rand(N) .- 0.5);

julia> b = @benchmark begin #7
           @inbounds for i in 1 : $N
               sin($x[i])
           end
       end;

julia> minimum(b).time / N # ns
13.5881

Seven dubious ways to benchmark sin.

I don’t fully understand the difference between 1 and 2. However:

I think the peculiar difference between 2 and 3 is probably because the particular path that’s being taken through the if statements in the implementation of sin for x = 3.5 is easier for the compiler to statically analyze than the one for x = 0.5, and so the compiler is able to constant-fold / constant-propagate the whole computation.
The Ref used in 4 prevents the compiler from seeing the value of x as a constant. However, modern CPUs are very good at branch prediction, so after a while the particular path through the if statements in sin will be predicted correctly every time, resulting in a possible speedup.
In 5, the setup argument is used to produce random numbers to pass into sin, but only every few evaluations, so branch prediction may still be an issue. In fact, since rand produces numbers between 0 and 1, it is likely that the same branch is being taken every time anyway (and probably the same one as for sin(0.5), hence the similar result).
In 6, a new random number is computed at every evaluation, but now you’re also measuring the time spent in rand.
In 7, we’re iterating over 100000 precomputed inputs between -2 * pi and 2 * pi, so we’re hitting more of the branches in sin, without counting the time for random number generation. Increasing N above 100000 doesn’t appreciably change the result, so we can also be fairly certain that the branch predictor is defeated to a certain extent. However, it could be possible for the compiler to use more SIMD instructions due to the for loop (an @noinline function barrier could be used to shield against this). Also, x is still iterated over a number of times (multiple samples used by BenchmarkTools) and it fits into L3 cache (not L2 or L1). Decreasing N so that x does fit into L2 or L1 cache can change the result significantly.

See also this very interesting topic about branch prediction: PSA: Microbenchmarks remember branch history.

So an important question is: which scenario is closest to your use case?