Meaning of 2 minimum times of `@btime`

I know that <1ns @btime timings mean that the compiler hoisted the method call out of the benchmark loop completely, but it’s a consistent number ~0.034ns. What does that number mean, where does it come from?

julia> @btime 1+1
  0.034 ns (0 allocations: 0 bytes)
2

julia> @btime 1+1+1+1
  0.034 ns (0 allocations: 0 bytes)
4

There’s another consistent minimum timing ~1.508ns, this time if I interpolate the values so the benchmark is doing something. I can’t make any benchmark go lower than this.

julia> @btime $1+$1
  1.508 ns (0 allocations: 0 bytes)
2

julia> @btime $1+$1+$1+$1
  1.508 ns (0 allocations: 0 bytes)
4

At first I thought one of these might be a single cpu cycle, but I’m doing this on a 3.1 GHz processor, and if my Googling is right, a cpu cycle is the reciprocal of that, 0.323ns, nowhere close to either figure.

julia> @code_native debuginfo=:none (1+1)
	.text
	.file	"+"
	.globl	"julia_+_847"                   # -- Begin function julia_+_847
	.p2align	4, 0x90
	.type	"julia_+_847",@function
"julia_+_847":                          # @"julia_+_847"
	.cfi_startproc
# %bb.0:                                # %top
	leaq	(%rdi,%rsi), %rax
	retq
.Lfunc_end0:
	.size	"julia_+_847", .Lfunc_end0-"julia_+_847"
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

this doesn’t look like 1 cycle to me.

In general, O(ns) benchmarking is not meaningful / reliable, and also modern CPU has pipelines so “a single cycle / instruction” is not as clean-cut as you might think

2 Likes

But I would normally expect something unreliable to vary randomly. These are consistent 0.034ns and 1.508ns I’m seeing. I can accept that these are beyond the limits of timing precision and cannot mean anything close to “this is how long this method call took”, but they must come from somewhere.

And why doesn’t @btime only report meaningful timings i.e. never go below a timing known to be the limit of meaningfulness? Is 0.001ns even a meaningful precision for any subrange of 0-10ns timings?

My gut says these are probably related to the instruction level caches your CPU does. There are more caches than just L1, L2, etc., so I’m guessing “return this constant” is caching the constant itself, as well as where to put it - with potential microcode caching & optimizations as well. Those are no longer transparent to BenchmarkTools.

It’s just what the high precision timer spits out. IIRC there was a PR/issue for it about warning when such small numbers are returned, precisely to make people aware that those sub nanosecond timings are bogus.

1 Like

I don’t think this was mentioned already, but subnanosecond timings should become a memory in Julia v1.8, thanks to this PR which makes use of the new Base.donotdelete:

julia> @btime 1 + 1
  1.370 ns (0 allocations: 0 bytes)
2
5 Likes

So what does Base.donotdelete do? Seems odd this is a part of Base rather than something entirely in BenchmarkTools. Is that some way to tell the compiler “don’t hoist”?

It prevents dead-code elimination. This makes the benchmarks somewhat less realistic in the case the compiler is able to constant-propagate the result, but it also makes them more useful, as you get a better lower bound of the performance you can achieve without hoping the compiler will optimise everything away.

1 Like

Digging into the compiler seems drastic. The way it was explained to me, hoisting meant it was effectively benchmarking () -> 1+1. But shouldn’t it be doable to figure out from @btime 1+1 that (x, y) -> x+y is the appropriate benchmark, and x, y = 1, 1 just happens to be true every iteration? Or is this not the correct explanation?

Is it?

julia> @btime ((x, y) -> x + y)(1, 1)
  0.012 ns (0 allocations: 0 bytes)
2

(julia v1.7)

1 Like

The way it was explained to me, @btime ((x, y) -> x + y)(1, 1) would be benchmarking () -> ((x, y) -> x + y)(1, 1), which is compiled to () -> 2. Basically the benchmarked expression is put in a function body, and how many inputs you interpolate is how many arguments that wrapper function gets.

Right. The reason you need Base.donotdelete is to percent the compiler from constant proping away the wrapper.

2 Likes

I think what @Benny is saying is that maybe BenchmarkTools could automatically be able to “extract” constants from the expression, and instead of benchmarking 1+1, which is compiled away, it benchmarks x+y for some values. That the actual benchmark happens with the values 1,1 should then not have an effect on the compiled function, but is something that happens when the benchmark calls the function.

julia> f(x, y) = x + y
f (generic function with 1 method)

julia> @code_llvm f(1,1)
;  @ REPL[3]:1 within `f`
define i64 @julia_f_1012(i64 signext %0, i64 signext %1) #0 {
top:
; ┌ @ int.jl:87 within `+`
   %2 = add i64 %1, %0
; └
  ret i64 %2
}

julia> @btime f(1, 1) # This is removed by compiler
  0.026 ns (0 allocations: 0 bytes)
2

julia> @btime f(x, y) setup=begin x=1; y=1 end # This I expected to work
  0.026 ns (0 allocations: 0 bytes)
2

julia> @btime f(x, y) setup=begin x=rand([1]); y=rand([1]) end # This works
  1.952 ns (0 allocations: 0 bytes)
2

So the first one is as expected compiled away. The second I expected to work, but it also seems to be compiled away. The third seems to do the trick though, seems like it runs some instructions at least.

So if this is now possible, could BenchmarkTools automatically do this extraction of constants, and set them up in a similar way as above?

This might be the same (or at least similar) to what is done in 1.8?

1 Like

Okay I was under the very wrong impression that @btime is making the “wrapper function” from the input expression; it’s the compiler doing its usual optimizations (constant propagation, loop-invariant code motion) to the benchmark loop, hence the utility of having a way to tell the compiler to not do it.

A less confused question: we can manually $-interpolate to get the behavior we want, so why resort to the new Base.donotdelete instead of making @btime automatically insert $ before arguments in the input expression? It just seems like the latter approach can also benefit past versions of Julia. I suppose it won’t cover the Ref trick, but that seems to be less necessary now (at least not anymore for the example in the docs’ Quick Start).

To get back to the original question on fixed times, the minimum delta that time_ns() can measure seems to be limited (and is likely system-dependent). On my system:

julia> function f()
           x, y = time_ns(), time_ns()
           return Int(y - x)
       end
f (generic function with 1 method)

julia> first(sort(unique(f() for i in 1:1000000)), 10)
10-element Vector{Int64}:
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41

julia> @benchmark 1+1
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.034 ns … 0.088 ns  ┊ GC (min … max): 0.00% … 0.00%

Note that it’s grouping the execution into groups of 1000 evaluations. 34/1000 == 0.034. It’s not as obvious to me where your regularly appearing 1.508ns stems from, but IIRC when you interpolate Julia does a function call.

2 Likes