How to benchmark properly? Should defaults change?

What is correct here? I guess is the one with evals=1. In that case, couldn’t it be the default option?

julia> using BenchmarkTools

julia> @btime sin(5.0)
  1.538 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> x = 5.0
5.0

julia> @btime sin($x)
  7.530 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> @btime sin($x) evals=1
  40.000 ns (0 allocations: 0 bytes)
-0.9589242746631385


@btime gives you the minimal time, so of course eval=1 will make a difference

When things can be constant propagated, I usually Ref wrap the input like:

julia> using BenchmarkTools

julia> @btime sin(5.0) # bogus result from const prop
  0.013 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> x = 5.0
5.0

julia> @btime sin($x) # bogus result from const prop
  0.013 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> @btime sin($(Ref(x))[]) # ok
  6.476 ns (0 allocations: 0 bytes)
-0.9589242746631385
3 Likes

I don’t think is simply that. With evals=1 it is running many samples of 1 evaluation each. Without it, it is running many samples of many evaluations (at least is what I understand from the manual). My understanding is that when more than one evaluation per sample is being run, we are getting some artifact associated to caching results.

julia> @benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.539 ns … 42.284 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.425 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.240 ns Β±  1.905 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–ˆ β–…                                                     
  β–β–‚β–†β–…β–ˆβ–ˆβ–ˆβ–β–β–β–‚β–‚β–ƒβ–β–ƒβ–β–‚β–β–‚β–β–‚β–…β–β–†β–‚β–β–„β–β–β–‚β–β–β–β–‚β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  7.54 ns        Histogram: frequency by time        15.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sin($x) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  41.000 ns … 542.000 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     45.000 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   45.989 ns Β±   5.730 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           ▁    β–†    β–ˆ                                          
  β–ƒβ–β–β–β–…β–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–‡β–β–β–β–β–†β–β–β–β–β–‡β–β–β–β–β–…β–β–β–β–β–…β–β–β–β–β–„β–β–β–β–β–„β–β–β–β–β–ƒ β–ƒ
  41 ns           Histogram: frequency by time           53 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


@kristoffer.carlsson I think you still have constant propagation in your Ref example, but within samples. Might be wrong though. What you get if you use evals=1? (your timings are very different from mine here).

Here I get:

julia> @btime sin($(Ref(x))[]) 
  7.778 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> @btime sin($(Ref(x))[]) evals=1
  40.000 ns (0 allocations: 0 bytes)
-0.9589242746631385

I use nightly Julia which probably has better constant propagation.

2 Likes

That is good news for Julia, but not so for BenchmarkTools, which becomes harder to understand.

I still don’t understand the difference in the results. It is even strange that, for example, using evals=1 one gets:

julia> @btime sin($x) evals=1
  40.000 ns (0 allocations: 0 bytes)
-0.9589242746631385

and with evals=2 one gets:

julia> @btime sin($x) evals=2
  24.500 ns (0 allocations: 0 bytes)
-0.9589242746631385

And by increasing evals one converges to about 7 ns which is (?) the correct benchmark (?).

These are completely systematic, thus these results do not seem to be associated with random fluctuations of the benchmark.

I checked the code a few months ago and it seemed that the runtime of a single time_ns() call is always added to the measuerement, resulting in a 25ns/evals error on my machine.

3 Likes

Ah, that is one reason for the problem. Ok. So that systematic error is diluted when one uses many evaluations in each sample.

That mixed with the constant propagation thing makes makes benchmarking a little bit confusing. Perhaps there is room for improvement in the API?

I’ve found that I get more robust results by just broadcasting over an input vector:

julia> @btime sin.(x) setup=(x=rand(1000));
  6.083 ΞΌs (1 allocation: 7.94 KiB)

It’s less ergonomic, but it does a good job guarding against overly-aggressive constant propagation.

4 Likes

This is also good for functions with branches. For example, sin will be faster for numbers less than pi/4.

3 Likes

Plus, it makes the SIMD implementations look good. :wink:

I normally Ref-wrap any isbits structs I’m benchmarking.

40ns is way too long. That’d be well over 100 clock cycles for most CPUs.

Couldn’t that be automatic?

1 Like

It’s of course possible that the compiler will get smart enough to defeat this, too, eventually.

But I’d be in favor of it ref-wrapping everything by default.

The macro doesn’t have access to type information, but it could probably generate code that’s the equivalent of

rx = Ref(x)
isbits(x) ? x : rx[]

Or maybe just ref-wrap everything by default.

2 Likes

There is some disagreement on whether minimum should be used.

Robust benchmarking in noisy environments (2016)

Minimum Times Tend to Mislead When Benchmarking (2019)