Something to keep in mind when running microbenchmarks: The sneaky CPU can remember astoundingly long branch history patterns. If you run @btime
on somthing, then you generate a periodic branching pattern: The period length is the size of your benchmark, and it is repeated very often by the benchmark loop. This is because @btime
runs your code multiple times and then averages the timings, because we don’t get clock with cycle-precision.
This can spoil benchmarks and has affected some BaseBenchmarks. On Intel this is an emergent effect of the way the branch history table works (period 1000 means there are only 1000 possible histories of length 30; they all fit into the BHT and permit perfect prediction).
The general fix is to @btime
on larger test-vectors (but, depending on what you want to measure, not too large: If they are longer than your cache and typical applications, then you risk benchmarking your memory bandwidth instead).
Demo:
julia> using Printf, BenchmarkTools
julia> function cpu_speed_ghz()
# if available, use reported CPU speed instead of current speed (which
# may be inaccurate if the CPU is idling)
cpu_info = Sys.cpu_info()[1]
m = match(r"([\d\.]+)GHz", cpu_info.model)
ghz = m ≡ nothing ? cpu_info.speed / 1000 : parse(Float64, m.captures[1])
end;
julia> const CPU_SPEED_GHZ = cpu_speed_ghz();
julia> const cpu_model = Sys.cpu_info()[1].model;
julia> begin
N=30_000
list = fill(false, N); list[1:2:end].=true;
bt0 = @belapsed findall($list)
list .= rand(Bool, N)
btL = @belapsed findall($list)
time_to_cycle = 10^9/N * CPU_SPEED_GHZ
penalty = 2*(btL-bt0)*time_to_cycle
@printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n",
cpu_model, penalty/CPU_SPEED_GHZ , penalty)
bt = bt0
@printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n",
2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 30_000]
pat = rand(Bool, n)
for i=1:n:N list[i:(i+n-1)].=pat end
bt = @belapsed findall($list)
@printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n",
n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
end
end;
yielding e.g.:
Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty: 9.9 ns = 19.8 cycles
Period 2: 44.81 us = 2.99 cycles per idx. Miss-rate 0.00%
Period 100: 53.22 us = 3.55 cycles per idx. Miss-rate 2.83%
Period 500: 51.52 us = 3.43 cycles per idx. Miss-rate 2.26%
Period 1000: 51.37 us = 3.42 cycles per idx. Miss-rate 2.21%
Period 2000: 57.85 us = 3.86 cycles per idx. Miss-rate 4.39%
Period 2500: 88.66 us = 5.91 cycles per idx. Miss-rate 14.77%
Period 3000: 121.78 us = 8.12 cycles per idx. Miss-rate 25.93%
Period 5000: 159.28 us = 10.62 cycles per idx. Miss-rate 38.56%
Period 10000: 182.87 us = 12.19 cycles per idx. Miss-rate 46.51%
Period 30000: 192.51 us = 12.83 cycles per idx. Miss-rate 49.75%
This tells us that it is impossible to run micro-benchmarks with 2000 completely unpredictable conditional branches on my CPU; and even 5000 branches will be quite distorted. This is not a small effect: The timings differ by a factor of 4.
We have so far collected data for laptop+server broadwell and desktop-skylake (very similar). If you have access to AMD, Power or ARM systems (well, regardless of what CPU you have), then I’d be grateful if you could post numbers here, so that we all can figure out the correct minimal period for microbenchmarks involving unpredictable conditional branches. Cf also https://github.com/JuliaCI/BaseBenchmarks.jl/issues/241.
I am especially concerned about Ryzen: Agner Fog suggests that “Branch prediction in Ryzen is based on perceptrons. […] Repetitive patterns are predicted very well, with no apparent limit to the length of the period.”.
If we want to benchmark on Ryzen, then we must know the limit to the length of the period.