Meaning of 2 minimum times of `@btime`

Benny · June 20, 2022, 3:46am

I know that <1ns @btime timings mean that the compiler hoisted the method call out of the benchmark loop completely, but it’s a consistent number ~0.034ns. What does that number mean, where does it come from?

julia> @btime 1+1
  0.034 ns (0 allocations: 0 bytes)
2

julia> @btime 1+1+1+1
  0.034 ns (0 allocations: 0 bytes)
4

There’s another consistent minimum timing ~1.508ns, this time if I interpolate the values so the benchmark is doing something. I can’t make any benchmark go lower than this.

julia> @btime $1+$1
  1.508 ns (0 allocations: 0 bytes)
2

julia> @btime $1+$1+$1+$1
  1.508 ns (0 allocations: 0 bytes)
4

At first I thought one of these might be a single cpu cycle, but I’m doing this on a 3.1 GHz processor, and if my Googling is right, a cpu cycle is the reciprocal of that, 0.323ns, nowhere close to either figure.

jling · June 20, 2022, 4:14am

julia> @code_native debuginfo=:none (1+1)
	.text
	.file	"+"
	.globl	"julia_+_847"                   # -- Begin function julia_+_847
	.p2align	4, 0x90
	.type	"julia_+_847",@function
"julia_+_847":                          # @"julia_+_847"
	.cfi_startproc
# %bb.0:                                # %top
	leaq	(%rdi,%rsi), %rax
	retq
.Lfunc_end0:
	.size	"julia_+_847", .Lfunc_end0-"julia_+_847"
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

this doesn’t look like 1 cycle to me.

In general, O(ns) benchmarking is not meaningful / reliable, and also modern CPU has pipelines so “a single cycle / instruction” is not as clean-cut as you might think

Benny · June 20, 2022, 7:04am

But I would normally expect something unreliable to vary randomly. These are consistent 0.034ns and 1.508ns I’m seeing. I can accept that these are beyond the limits of timing precision and cannot mean anything close to “this is how long this method call took”, but they must come from somewhere.

And why doesn’t @btime only report meaningful timings i.e. never go below a timing known to be the limit of meaningfulness? Is 0.001ns even a meaningful precision for any subrange of 0-10ns timings?

Sukera · June 20, 2022, 8:40am

My gut says these are probably related to the instruction level caches your CPU does. There are more caches than just L1, L2, etc., so I’m guessing “return this constant” is caching the constant itself, as well as where to put it - with potential microcode caching & optimizations as well. Those are no longer transparent to BenchmarkTools.

It’s just what the high precision timer spits out. IIRC there was a PR/issue for it about warning when such small numbers are returned, precisely to make people aware that those sub nanosecond timings are bogus.

giordano · June 20, 2022, 10:29am

I don’t think this was mentioned already, but subnanosecond timings should become a memory in Julia v1.8, thanks to this PR which makes use of the new Base.donotdelete:

julia> @btime 1 + 1
  1.370 ns (0 allocations: 0 bytes)
2

Benny · June 20, 2022, 10:36am

So what does Base.donotdelete do? Seems odd this is a part of Base rather than something entirely in BenchmarkTools. Is that some way to tell the compiler “don’t hoist”?

giordano · June 20, 2022, 10:44am

It prevents dead-code elimination. This makes the benchmarks somewhat less realistic in the case the compiler is able to constant-propagate the result, but it also makes them more useful, as you get a better lower bound of the performance you can achieve without hoping the compiler will optimise everything away.

Benny · June 20, 2022, 10:50am

Digging into the compiler seems drastic. The way it was explained to me, hoisting meant it was effectively benchmarking () -> 1+1. But shouldn’t it be doable to figure out from @btime 1+1 that (x, y) -> x+y is the appropriate benchmark, and x, y = 1, 1 just happens to be true every iteration? Or is this not the correct explanation?

giordano · June 20, 2022, 10:56am

Is it?

julia> @btime ((x, y) -> x + y)(1, 1)
  0.012 ns (0 allocations: 0 bytes)
2

(julia v1.7)

Benny · June 20, 2022, 11:07am

The way it was explained to me, @btime ((x, y) -> x + y)(1, 1) would be benchmarking () -> ((x, y) -> x + y)(1, 1), which is compiled to () -> 2. Basically the benchmarked expression is put in a function body, and how many inputs you interpolate is how many arguments that wrapper function gets.

Oscar_Smith · June 20, 2022, 11:16am

Right. The reason you need Base.donotdelete is to percent the compiler from constant proping away the wrapper.

albheim · June 20, 2022, 11:37am

I think what @Benny is saying is that maybe BenchmarkTools could automatically be able to “extract” constants from the expression, and instead of benchmarking 1+1, which is compiled away, it benchmarks x+y for some values. That the actual benchmark happens with the values 1,1 should then not have an effect on the compiled function, but is something that happens when the benchmark calls the function.

julia> f(x, y) = x + y
f (generic function with 1 method)

julia> @code_llvm f(1,1)
;  @ REPL[3]:1 within `f`
define i64 @julia_f_1012(i64 signext %0, i64 signext %1) #0 {
top:
; ┌ @ int.jl:87 within `+`
   %2 = add i64 %1, %0
; └
  ret i64 %2
}

julia> @btime f(1, 1) # This is removed by compiler
  0.026 ns (0 allocations: 0 bytes)
2

julia> @btime f(x, y) setup=begin x=1; y=1 end # This I expected to work
  0.026 ns (0 allocations: 0 bytes)
2

julia> @btime f(x, y) setup=begin x=rand([1]); y=rand([1]) end # This works
  1.952 ns (0 allocations: 0 bytes)
2

So the first one is as expected compiled away. The second I expected to work, but it also seems to be compiled away. The third seems to do the trick though, seems like it runs some instructions at least.

So if this is now possible, could BenchmarkTools automatically do this extraction of constants, and set them up in a similar way as above?

This might be the same (or at least similar) to what is done in 1.8?

Benny · June 20, 2022, 5:42pm

Okay I was under the very wrong impression that @btime is making the “wrapper function” from the input expression; it’s the compiler doing its usual optimizations (constant propagation, loop-invariant code motion) to the benchmark loop, hence the utility of having a way to tell the compiler to not do it.

A less confused question: we can manually $-interpolate to get the behavior we want, so why resort to the new Base.donotdelete instead of making @btime automatically insert $ before arguments in the input expression? It just seems like the latter approach can also benefit past versions of Julia. I suppose it won’t cover the Ref trick, but that seems to be less necessary now (at least not anymore for the example in the docs’ Quick Start).

mbauman · June 20, 2022, 7:25pm

To get back to the original question on fixed times, the minimum delta that time_ns() can measure seems to be limited (and is likely system-dependent). On my system:

julia> function f()
           x, y = time_ns(), time_ns()
           return Int(y - x)
       end
f (generic function with 1 method)

julia> first(sort(unique(f() for i in 1:1000000)), 10)
10-element Vector{Int64}:
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41

julia> @benchmark 1+1
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.034 ns … 0.088 ns  ┊ GC (min … max): 0.00% … 0.00%

Note that it’s grouping the execution into groups of 1000 evaluations. 34/1000 == 0.034. It’s not as obvious to me where your regularly appearing 1.508ns stems from, but IIRC when you interpolate Julia does a function call.

Topic		Replies	Views
Benchmarking with @time @btime and subsequent runs return shorter execution time New to Julia benchmark , benchmarktools	8	2101	December 31, 2020
Understanding output of @time and @btime under cgroup CPUQuota constrains Performance	6	277	September 7, 2023
Save @btime output New to Julia question , benchmarktools	7	3339	April 1, 2023
`@btime 5+2` General Usage question	2	424	August 13, 2018
@btime's wrong output New to Julia	5	440	August 24, 2019

Meaning of 2 minimum times of `@btime`

Related topics