Basic Performance Difference Between 1.8.0-rc3 and 1.7.2

I have both 1.8.0-rc3 and 1.7.2 installed. I am running the simple benchmark

using BenchmarkTools; @benchmark a + b setup=(b = rand(Int64); a = rand(Int64))

which from my understanding, should really just be measuring my CPU speed. In 1.7.2, the results are

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.200 ns … 2.890 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.210 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.211 ns ± 0.043 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂█                                                         
  ██▂▃▁▁▁▂▁▁▁▁▁▂▁▁▁▂▁▁▁▁▂▁▂▁▂▁▂▁▂▁▁▁▁▁▁▂▁▂▁▁▁▁▁▂▁▂▁▂▁▂▁▁▁▁▂ ▂
  1.2 ns         Histogram: frequency by time        1.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

In 1.8.0-rc3, however, the results are

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.100 ns … 31.970 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.120 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.130 ns ±  0.304 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂           █                                    
  ▂▁▁▁▁▁▁▁▁▁▁█▂▁▁▁▁▁▁▁▁▁▁█▂▁▁▁▁▁▁▁▁▁▇▂▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▆ ▂
  2.1 ns         Histogram: frequency by time        2.15 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

There is also a difference in a = rand(Int64); b = rand(Int64); @code_native a + b, with 1.7.2 returning

	.text
; ┌ @ int.jl:87 within `+`
	leaq	(%rdi,%rsi), %rax
	retq
	nopw	%cs:(%rax,%rax)
	nop
; └

and 1.8.0-rc3 returning

	.text
	.file	"+"
	.globl	"julia_+_763"                   # -- Begin function julia_+_763
	.p2align	4, 0x90
	.type	"julia_+_763",@function
"julia_+_763":                          # @"julia_+_763"
; ┌ @ int.jl:87 within `+`
	.cfi_startproc
# %bb.0:                                # %top
	leaq	(%rdi,%rsi), %rax
	retq
.Lfunc_end0:
	.size	"julia_+_763", .Lfunc_end0-"julia_+_763"
	.cfi_endproc
; └
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

Why is this?

In case it’s helpful, here are both versioninfo’s

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 7 2700X Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver1)
Julia Version 1.8.0-rc3
Commit 33f19bcbd25 (2022-07-13 19:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 2700X Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver1)
  Threads: 1 on 16 virtual cores

I’m noticing the different LLVM version; is that it?

you’re just seeing debug info, try this on 1.8:

julia> @code_native dump_module=false 1+2
	.text
; ┌ @ int.jl:87 within `+`
	leaq	(%rdi,%rsi), %rax
	retq
	nopw	%cs:(%rax,%rax)
	nop
; └

O(ns) timing is hard to believe, you might just be seeing more accurate timing in 1.8 where in 1.7 something got constant-folded

2 Likes

I think it is important to ask: Why are you trying to benchmark the addition of two single integers? Most likely, the empirical answer to the question “how long does it take to add two integers” isn’t very helpful for anything practical:

  1. The benchmark result (a couple of nanoseconds) is likely “wrong” due to a noisy environment (your computer), a potentially overly smart compiler, etc. (essentially the point that @jling already made above)
  2. Adding two single integers will (almost?) never be a bottleneck in your code, so why care in the first place?

Well, depends on what you mean by “CPU speed”, but in general I’d say: No, not really. First of all, you wouldn’t assess the performance of your entire CPU but only (one unit in) a single CPU core (which you probably have multiple of). Second, if anything, you’d probably want to measure floating point performance: that’s what almost all actual computations use, integers are mostly only used for indexing. And if you would want to measure FP performance, I’d at least recommend to increase the problem size, i.e. benchmark x .+ y for x = rand(N); y = rand(N); for a reasonably large N. In any case, to actually benchmark the “CPU speed” one typically does one of the following:

  1. Use matrix multiplications (e.g. peakflops() in Julia)
  2. Use the LINPACK benchmark (solving dense linear systems; a version of this is used to rank supercomputers in the top500 list)

or, if you care about a more low-level definition of “CPU speed”,

  1. Write an artificial computational kernel that performs enough FMAs to saturate your entire CPU. (This is for example done in likwid-bench, see this hand-coded assembly, or in GPUInspector.jl for CUDA cores in NVIDIA GPUs, see here)
1 Like