PSA: Microbenchmarks remember branch history

foobar_lv2 · November 12, 2018, 3:21pm

Something to keep in mind when running microbenchmarks: The sneaky CPU can remember astoundingly long branch history patterns. If you run @btime on somthing, then you generate a periodic branching pattern: The period length is the size of your benchmark, and it is repeated very often by the benchmark loop. This is because @btime runs your code multiple times and then averages the timings, because we don’t get clock with cycle-precision.

This can spoil benchmarks and has affected some BaseBenchmarks. On Intel this is an emergent effect of the way the branch history table works (period 1000 means there are only 1000 possible histories of length 30; they all fit into the BHT and permit perfect prediction).

The general fix is to @btime on larger test-vectors (but, depending on what you want to measure, not too large: If they are longer than your cache and typical applications, then you risk benchmarking your memory bandwidth instead).

Demo:

julia> using Printf, BenchmarkTools

julia> function cpu_speed_ghz()
                  # if available, use reported CPU speed instead of current speed (which
                  # may be inaccurate if the CPU is idling)
                  cpu_info = Sys.cpu_info()[1]
                  m = match(r"([\d\.]+)GHz", cpu_info.model)
                  ghz = m ≡ nothing ? cpu_info.speed / 1000 : parse(Float64,  m.captures[1])
              end;

julia> const CPU_SPEED_GHZ = cpu_speed_ghz();

julia> const cpu_model = Sys.cpu_info()[1].model;

julia> begin 
           N=30_000
           list = fill(false, N); list[1:2:end].=true;
           bt0 = @belapsed findall($list)
           list .= rand(Bool, N)
           btL = @belapsed findall($list)
           time_to_cycle = 10^9/N * CPU_SPEED_GHZ
           penalty = 2*(btL-bt0)*time_to_cycle
           @printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
           cpu_model, penalty/CPU_SPEED_GHZ , penalty)

           bt = bt0
           @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 30_000]
               pat = rand(Bool, n)
               for i=1:n:N list[i:(i+n-1)].=pat end
               bt = @belapsed findall($list)
               @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           end
       end;

yielding e.g.:

Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty:  9.9 ns = 19.8 cycles

Period     2:   44.81 us =    2.99 cycles per idx. Miss-rate  0.00%
Period   100:   53.22 us =    3.55 cycles per idx. Miss-rate  2.83%
Period   500:   51.52 us =    3.43 cycles per idx. Miss-rate  2.26%
Period  1000:   51.37 us =    3.42 cycles per idx. Miss-rate  2.21%
Period  2000:   57.85 us =    3.86 cycles per idx. Miss-rate  4.39%
Period  2500:   88.66 us =    5.91 cycles per idx. Miss-rate 14.77%
Period  3000:  121.78 us =    8.12 cycles per idx. Miss-rate 25.93%
Period  5000:  159.28 us =   10.62 cycles per idx. Miss-rate 38.56%
Period 10000:  182.87 us =   12.19 cycles per idx. Miss-rate 46.51%
Period 30000:  192.51 us =   12.83 cycles per idx. Miss-rate 49.75%

This tells us that it is impossible to run micro-benchmarks with 2000 completely unpredictable conditional branches on my CPU; and even 5000 branches will be quite distorted. This is not a small effect: The timings differ by a factor of 4.

We have so far collected data for laptop+server broadwell and desktop-skylake (very similar). If you have access to AMD, Power or ARM systems (well, regardless of what CPU you have), then I’d be grateful if you could post numbers here, so that we all can figure out the correct minimal period for microbenchmarks involving unpredictable conditional branches. Cf also https://github.com/JuliaCI/BaseBenchmarks.jl/issues/241.

I am especially concerned about Ryzen: Agner Fog suggests that “Branch prediction in Ryzen is based on perceptrons. […] Repetitive patterns are predicted very well, with no apparent limit to the length of the period.”.

If we want to benchmark on Ryzen, then we must know the limit to the length of the period.

JaredCrean2 · November 12, 2018, 8:36pm

For Power9:

unknown; branch-miss penalty:  6.4 ns = 24.4 cycles

Period     2:   41.95 us =    5.31 cycles per idx. Miss-rate  0.00%
Period   100:   52.86 us =    6.70 cycles per idx. Miss-rate  5.67%
Period   500:   63.63 us =    8.06 cycles per idx. Miss-rate 11.26%
Period  1000:   89.76 us =   11.37 cycles per idx. Miss-rate 24.84%
Period  2000:  110.29 us =   13.97 cycles per idx. Miss-rate 35.50%
Period  2500:  115.06 us =   14.57 cycles per idx. Miss-rate 37.98%
Period  3000:  119.54 us =   15.14 cycles per idx. Miss-rate 40.31%
Period  5000:  129.34 us =   16.38 cycles per idx. Miss-rate 45.40%
Period 10000:  134.25 us =   17.01 cycles per idx. Miss-rate 47.95%
Period 30000:  138.21 us =   17.51 cycles per idx. Miss-rate 50.01%

kristoffer.carlsson · November 12, 2018, 9:25pm

Insert an lfence to flush the branch predictor between benchmark samples?

foobar_lv2 · November 12, 2018, 9:55pm

I think that the lfence would prevent us from out-of-order executing multiple evals in parallel. Afaik it does not flush the predictor state.

Even if we were able to flush the branch predictor, that is not something we want to benchmark: Our microbenchmark is supposed to simulate warm cache and warm predictor with fresh data.

On the other hand, an lfence should make a large difference for latency-constrained operations. Say AES encryption. As far as I know, we currently benchmark throughput on this. This is what the user usually wants: time to encrypt 1 block times N equals time to encrypt N blocks (in ctr mode). Since AESENC has latency 7 and inverse throughput 1 on broadwell, we need to encrypt seven blocks in parallel to get full throughput. Amazingly, this appears to fit into the ROB, regardless of whether we have a benchmark loop or encrypt a large buffer.

As a user, I’m more interested in inverse throughput than latency when benchmarking entire functions. But I’m not interested in the CPU pulling a volkswagen on me.

PS. Cf BranchScope, which is a new attack related to this (also in the wake of spectre). Only that we deal with periodic patterns of length 1000+, not 10. The mitigations section of that paper looks rather thin; I still think using a testvector that overflows the pattern history table is the simplest answer.

traktofon · November 13, 2018, 9:39am

Here’s results from a first-gen Ryzen CPU. Ran with Julia 1.0.2, set the cpufreq governor to performance (in the hope that makes results more stable) and pinned execution to a single core. I’ve ran the test several times and while the absolute timings look quite stable, there’s some fluctuation in how many cycles the branch-miss penalty is. Mostly it’s 15-18 cycles but in one run it gave 8.6 cycles!

AMD Ryzen 5 1600X Six-Core Processor; branch-miss penalty:  4.5 ns = 15.1 cycles

Period     2:   31.32 us =    3.46 cycles per idx. Miss-rate  0.00%
Period   100:   35.21 us =    3.89 cycles per idx. Miss-rate  2.85%
Period   500:   39.97 us =    4.41 cycles per idx. Miss-rate  6.34%
Period  1000:   45.73 us =    5.05 cycles per idx. Miss-rate 10.56%
Period  2000:   52.06 us =    5.75 cycles per idx. Miss-rate 15.20%
Period  2500:   56.00 us =    6.18 cycles per idx. Miss-rate 18.09%
Period  3000:   59.62 us =    6.58 cycles per idx. Miss-rate 20.74%
Period  5000:   67.42 us =    7.44 cycles per idx. Miss-rate 26.46%
Period 10000:   79.61 us =    8.79 cycles per idx. Miss-rate 35.39%
Period 30000:   99.54 us =   10.99 cycles per idx. Miss-rate 50.00%

JeffreySarnoff · November 13, 2018, 10:52am

From an Intel i7


Intel(R) Core(TM) i7-6820HK CPU @ 2.70GHz
branch-miss penalty:  6.6 ns = 17.9 cycles

Period     2:  109.30 us =    2.46 cycles per idx. Miss-rate  0.00%
Period   100:  107.96 us =    2.43 cycles per idx. Miss-rate -0.17%
Period   500:  108.60 us =    2.44 cycles per idx. Miss-rate -0.09%
Period  1000:  108.60 us =    2.44 cycles per idx. Miss-rate -0.09%
Period  2000:  120.39 us =    2.71 cycles per idx. Miss-rate  1.39%
Period  2500:  261.97 us =    5.89 cycles per idx. Miss-rate 19.19%
Period  3000:  290.35 us =    6.53 cycles per idx. Miss-rate 22.76%
Period  5000:  410.99 us =    9.25 cycles per idx. Miss-rate 37.92%
Period 10000:  464.78 us =   10.46 cycles per idx. Miss-rate 44.68%
Period 15000:  484.31 us =   10.90 cycles per idx. Miss-rate 47.13%
Period 20000:  492.21 us =   11.07 cycles per idx. Miss-rate 48.13%
Period 30000:  500.81 us =   11.27 cycles per idx. Miss-rate 49.21%
Period 40000:  503.20 us =   11.32 cycles per idx. Miss-rate 49.51%
Period 60000:  504.95 us =   11.36 cycles per idx. Miss-rate 49.73%


Intel(R) Core(TM) i7-6820HK CPU @ 2.70GHz
branch-miss penalty:  6.6 ns = 17.9 cycles

Period     2:  109.36 us =    2.46 cycles per idx. Miss-rate  0.00%
Period   100:  109.27 us =    2.46 cycles per idx. Miss-rate -0.01%
Period   500:  108.35 us =    2.44 cycles per idx. Miss-rate -0.13%
Period  1000:  109.30 us =    2.46 cycles per idx. Miss-rate -0.01%
Period  2000:  120.08 us =    2.70 cycles per idx. Miss-rate  1.35%
Period  2500:  240.35 us =    5.41 cycles per idx. Miss-rate 16.51%
Period  3000:  308.59 us =    6.94 cycles per idx. Miss-rate 25.11%
Period  5000:  418.83 us =    9.42 cycles per idx. Miss-rate 39.01%
Period 10000:  468.64 us =   10.54 cycles per idx. Miss-rate 45.28%
Period 15000:  483.65 us =   10.88 cycles per idx. Miss-rate 47.17%
Period 20000:  495.71 us =   11.15 cycles per idx. Miss-rate 48.70%
Period 30000:  501.19 us =   11.28 cycles per idx. Miss-rate 49.39%
Period 40000:  502.69 us =   11.31 cycles per idx. Miss-rate 49.57%
Period 60000:  505.77 us =   11.38 cycles per idx. Miss-rate 49.96%

foobar_lv2 · November 13, 2018, 11:38am

Thanks a lot for that data!

Fuck. I calibrated the measured miss-penalty with period 30k in the hope that no CPU would be able to learn that. But the miss-rate is not yet saturated at period 10k, so who knows whether it is saturated at 30k. In other words, in my output the missrate for 30k period is by definition 50% and the miss-rate for period 2 is by definition 0%. If 30k patterns are recognized, then it is severely miscallibrated, and you see this in wrong numbers for the miss-penalty.

And, as you mentioned, thermal throttling is a possible issue (if the first runs are in turbo and the latter are not, then we miscallibrate timings for the later runs). I naively thought that this would be harmless, because we exercise only a single core, and the CPU can switch load around, such that a single-threaded application can run in turbo indefinitely.

Bad assumption of mine; always verify what you think you know. Trusting unverified assumptions is what triggered this mess. I think the frequency scaling depends on too many things to properly control (OS, Bios, etc). Thanks for reporting that the numbers are unreliable.

Can you run a new test on the following? Even if the numbers in cycles end up wrong (because the reported frequency is lower than the actual frequency), the reported mispredict rate should be roughly correct and independent of frequency scaling, as long as the second run has a stable frequency (hopefully temperature saturates in the ~60 sec of the first run; but if it doesn’t then we will at least be informed of the discrepancy).

julia> begin 
           for i=1:2 #reality check: run twice for thermal throttling
           N=120_000
           list = fill(false, N); list[1:2:end].=true;
           bt0 = @belapsed findall($list)
           list .= rand(Bool, N)
           btL = @belapsed findall($list)
           time_to_cycle = 10^9/N * CPU_SPEED_GHZ
           penalty = 2*(btL-bt0)*time_to_cycle
           @printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
           cpu_model, penalty/CPU_SPEED_GHZ , penalty)

           bt = bt0
           @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 15_000, 20_000, 30_000, 40_000, 60_000]
               pat = rand(Bool, n)
               for i=1:n:N list[i:(i+n-1)].=pat end
               bt = @belapsed findall($list) #run twice to give BPU more time to learn pattern
               bt = @belapsed findall($list)
               @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           end
           end
           end;

PS. I’d also be very interested in data from an AMD piledriver/bulldozer/excavator, if anyone here still has such a system. This is because AMD completely changed the pattern recognition approach in Ryzen: pre-Ryzen AMDs do something similar to intel, but implementation details matter.

PPS. Example output of the extended benchmark:

Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty: 10.1 ns = 20.2 cycles

Period     2:  177.01 us =    2.95 cycles per idx. Miss-rate  0.00%
Period   100:  199.69 us =    3.33 cycles per idx. Miss-rate  1.87%
Period   500:  202.94 us =    3.38 cycles per idx. Miss-rate  2.14%
Period  1000:  203.68 us =    3.39 cycles per idx. Miss-rate  2.20%
Period  2000:  220.81 us =    3.68 cycles per idx. Miss-rate  3.61%
Period  2500:  382.82 us =    6.38 cycles per idx. Miss-rate 16.97%
Period  3000:  485.13 us =    8.09 cycles per idx. Miss-rate 25.41%
Period  5000:  644.41 us =   10.74 cycles per idx. Miss-rate 38.55%
Period 10000:  732.48 us =   12.21 cycles per idx. Miss-rate 45.81%
Period 15000:  757.50 us =   12.63 cycles per idx. Miss-rate 47.88%
Period 20000:  768.23 us =   12.80 cycles per idx. Miss-rate 48.76%
Period 30000:  777.09 us =   12.95 cycles per idx. Miss-rate 49.49%
Period 40000:  782.04 us =   13.03 cycles per idx. Miss-rate 49.90%
Period 60000:  783.86 us =   13.06 cycles per idx. Miss-rate 50.05%


Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty: 10.1 ns = 20.2 cycles

Period     2:  178.22 us =    2.97 cycles per idx. Miss-rate  0.00%
Period   100:  194.38 us =    3.24 cycles per idx. Miss-rate  1.33%
Period   500:  201.00 us =    3.35 cycles per idx. Miss-rate  1.88%
Period  1000:  202.71 us =    3.38 cycles per idx. Miss-rate  2.02%
Period  2000:  218.43 us =    3.64 cycles per idx. Miss-rate  3.31%
Period  2500:  365.77 us =    6.10 cycles per idx. Miss-rate 15.44%
Period  3000:  489.19 us =    8.15 cycles per idx. Miss-rate 25.60%
Period  5000:  639.25 us =   10.65 cycles per idx. Miss-rate 37.96%
Period 10000:  735.78 us =   12.26 cycles per idx. Miss-rate 45.90%
Period 15000:  747.85 us =   12.46 cycles per idx. Miss-rate 46.90%
Period 20000:  767.51 us =   12.79 cycles per idx. Miss-rate 48.52%
Period 30000:  778.63 us =   12.98 cycles per idx. Miss-rate 49.43%
Period 40000:  783.14 us =   13.05 cycles per idx. Miss-rate 49.80%
Period 60000:  785.63 us =   13.09 cycles per idx. Miss-rate 50.01%

pfitzseb · November 13, 2018, 11:55am

Seems like 60k isn’t enough either…

AMD Ryzen 5 1600 Six-Core Processor            ; branch-miss penalty:  6.2 ns = 23.6 cycles

Period     2:  115.82 us =    3.66 cycles per idx. Miss-rate  0.00%
Period   100:  136.07 us =    4.30 cycles per idx. Miss-rate  2.71%
Period   500:  166.31 us =    5.26 cycles per idx. Miss-rate  6.75%
Period  1000:  189.80 us =    6.00 cycles per idx. Miss-rate  9.89%
Period  2000:  216.25 us =    6.84 cycles per idx. Miss-rate 13.43%
Period  2500:  228.94 us =    7.24 cycles per idx. Miss-rate 15.13%
Period  3000:  241.90 us =    7.65 cycles per idx. Miss-rate 16.86%
Period  5000:  286.18 us =    9.05 cycles per idx. Miss-rate 22.78%
Period 10000:  337.75 us =   10.68 cycles per idx. Miss-rate 29.68%
Period 15000:  366.90 us =   11.60 cycles per idx. Miss-rate 33.57%
Period 20000:  389.58 us =   12.31 cycles per idx. Miss-rate 36.61%
Period 30000:  417.12 us =   13.18 cycles per idx. Miss-rate 40.29%
Period 40000:  440.34 us =   13.92 cycles per idx. Miss-rate 43.39%
Period 60000:  458.97 us =   14.51 cycles per idx. Miss-rate 45.88%


AMD Ryzen 5 1600 Six-Core Processor            ; branch-miss penalty:  6.2 ns = 23.5 cycles

Period     2:  117.17 us =    3.70 cycles per idx. Miss-rate  0.00%
Period   100:  137.42 us =    4.34 cycles per idx. Miss-rate  2.72%
Period   500:  158.21 us =    5.00 cycles per idx. Miss-rate  5.51%
Period  1000:  185.21 us =    5.85 cycles per idx. Miss-rate  9.14%
Period  2000:  218.14 us =    6.90 cycles per idx. Miss-rate 13.56%
Period  2500:  224.08 us =    7.08 cycles per idx. Miss-rate 14.36%
Period  3000:  244.33 us =    7.72 cycles per idx. Miss-rate 17.08%
Period  5000:  280.78 us =    8.87 cycles per idx. Miss-rate 21.97%
Period 10000:  335.05 us =   10.59 cycles per idx. Miss-rate 29.26%
Period 15000:  364.47 us =   11.52 cycles per idx. Miss-rate 33.21%
Period 20000:  388.77 us =   12.29 cycles per idx. Miss-rate 36.48%
Period 30000:  420.36 us =   13.29 cycles per idx. Miss-rate 40.72%
Period 40000:  436.02 us =   13.78 cycles per idx. Miss-rate 42.82%
Period 60000:  461.13 us =   14.58 cycles per idx. Miss-rate 46.19%

foobar_lv2 · November 13, 2018, 12:55pm

WTF AMD. I honestly cannot decide whether to be impressed (want to buy a Ryzen now) or disgusted (distrust every published Ryzen benchmark). Both, I guess.

We might need to fix this in BenchmarkTools, at least partially. The fix there would be to use the TSC register for timing, ask users to use setup and then fix evals=1 if a setup is provided. But then we may need to look at the way setup and interpolation are used.

Can you run this once larger? We are now beyond testvector sizes that we can ask users to provide, but I just want to know the limit on the period of patterns Ryzen is capable of learning. It would be very interesting to know what kind of patterns Ryzen is good at, but unfortunately I couldn’t find a detailed account of its branch prediction algorithm.

julia> begin 
           for i=1:2
           N = 480_000
           list = fill(false, N); list[1:2:end].=true;
           bt0 = @belapsed findall($list)
           list .= rand(Bool, N)
           btL = @belapsed findall($list)
           time_to_cycle = 10^9/N * CPU_SPEED_GHZ
           penalty = 2*(btL-bt0)*time_to_cycle
           @printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
           cpu_model, penalty/CPU_SPEED_GHZ , penalty)

           bt = bt0
           @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 15_000, 20_000, 30_000, 40_000, 60_000, 80_000, 120_000, 240_000, 480_000]
               pat = rand(Bool, n)
               for i=1:n:N list[i:(i+n-1)].=pat end
               bt = @belapsed findall($list) #run thrice to give BPU more time to learn pattern
               bt = @belapsed findall($list)
               bt = @belapsed findall($list)
               @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           end
           end           
           end;

pfitzseb · November 13, 2018, 1:25pm

AMD Ryzen 5 1600 Six-Core Processor            ; branch-miss penalty:  6.6 ns = 25.2 cycles

Period     2:  851.52 us =    6.73 cycles per idx. Miss-rate  0.00%
Period   100:  936.83 us =    7.40 cycles per idx. Miss-rate  2.67%
Period   500: 1152.82 us =    9.11 cycles per idx. Miss-rate  9.44%
Period  1000: 1156.60 us =    9.14 cycles per idx. Miss-rate  9.56%
Period  2000: 1316.15 us =   10.40 cycles per idx. Miss-rate 14.56%
Period  2500: 1368.26 us =   10.81 cycles per idx. Miss-rate 16.19%
Period  3000: 1399.58 us =   11.06 cycles per idx. Miss-rate 17.17%
Period  5000: 1682.79 us =   13.30 cycles per idx. Miss-rate 26.04%
Period 10000: 1768.37 us =   13.97 cycles per idx. Miss-rate 28.73%
Period 15000: 1851.25 us =   14.63 cycles per idx. Miss-rate 31.32%
Period 20000: 1956.55 us =   15.46 cycles per idx. Miss-rate 34.62%
Period 30000: 2072.10 us =   16.37 cycles per idx. Miss-rate 38.24%
Period 40000: 2142.56 us =   16.93 cycles per idx. Miss-rate 40.45%
Period 60000: 2259.73 us =   17.86 cycles per idx. Miss-rate 44.12%
Period 80000: 2296.18 us =   18.14 cycles per idx. Miss-rate 45.26%
Period 120000: 2345.32 us =   18.53 cycles per idx. Miss-rate 46.80%
Period 240000: 2426.31 us =   19.17 cycles per idx. Miss-rate 49.34%
Period 480000: 2449.53 us =   19.36 cycles per idx. Miss-rate 50.07%


AMD Ryzen 5 1600 Six-Core Processor            ; branch-miss penalty:  6.6 ns = 25.1 cycles

Period     2:  864.21 us =    6.83 cycles per idx. Miss-rate  0.00%
Period   100:  948.98 us =    7.50 cycles per idx. Miss-rate  2.67%
Period   500: 1065.61 us =    8.42 cycles per idx. Miss-rate  6.34%
Period  1000: 1127.44 us =    8.91 cycles per idx. Miss-rate  8.29%
Period  2000: 1264.32 us =    9.99 cycles per idx. Miss-rate 12.60%
Period  2500: 1338.56 us =   10.58 cycles per idx. Miss-rate 14.94%
Period  3000: 1379.60 us =   10.90 cycles per idx. Miss-rate 16.24%
Period  5000: 1511.62 us =   11.94 cycles per idx. Miss-rate 20.39%
Period 10000: 1739.48 us =   13.75 cycles per idx. Miss-rate 27.57%
Period 15000: 1865.02 us =   14.74 cycles per idx. Miss-rate 31.53%
Period 20000: 1953.31 us =   15.44 cycles per idx. Miss-rate 34.31%
Period 30000: 2066.16 us =   16.33 cycles per idx. Miss-rate 37.86%
Period 40000: 2153.90 us =   17.02 cycles per idx. Miss-rate 40.63%
Period 60000: 2238.13 us =   17.69 cycles per idx. Miss-rate 43.28%
Period 80000: 2291.59 us =   18.11 cycles per idx. Miss-rate 44.97%
Period 120000: 2353.42 us =   18.60 cycles per idx. Miss-rate 46.91%
Period 240000: 2424.96 us =   19.16 cycles per idx. Miss-rate 49.17%
Period 480000: 2444.40 us =   19.32 cycles per idx. Miss-rate 49.78%

foobar_lv2 · November 13, 2018, 2:01pm

We clearly see that the prefetcher does not manage to keep up (6.73 cycles at period 2 with 480k elements vs 3.66 cycles at 120k elements), and we are beyond cache. On my linux broadwell, the perf numbers are virtually the same on 480k element arrays as on 30k. On your system, it appears that something is different. Whether this is worse linear prefetcher in Ryzen, or smaller page sizes, or whatever.

I give up. The combination of your two benchmarks (120k and 480k) is the best I know how to do.

"Repetitive patterns are predicted very well, with no apparent limit to the length of the period. Patterns of length 500 have 5% miss rate, 15k give 30%-35% misses; 60k seem to give 40%-50% miss-rate, but this is hard to test and should be considered suspect. Compared to intel, the prediction rate declines gracefully as the pattern period increases”

halleysfifthinc · November 13, 2018, 5:37pm

Could you describe what the relevance of the miss-rate is and the significance it has for benchmarks? Correct me if I’m wrong, but my understanding of processors is that they try to predict the outcome of the branch for better performance, hence, the performance cost of a branch mispredict. In the context of benchmarking, I assume that a miss-rate of 0% means the CPU is perfectly predicting the branches and therefore the results of the benchmarked code indicate a level of performance than may not be realistic in normal use.

What I’m not clear on is what these results are showing and/or how to integrate that into my own benchmarking? Is there a target miss-rate (e.g. ~50%)? For more realistic benchmarks, do I need to change the miss-rate?

mbauman · November 13, 2018, 6:23pm

The key is that we want benchmarks to reliably measure how expensive an arbitrary operation is when performed in situ. In order to reliably measure that, we repeatedly re-run the operation. Now, we all expected that this gives the benchmark certain advantages, e.g., that the cache is hot, but that’s ok because, in-situ, generally the cost of warming up the cache going to be paid once no matter what… and if the data are too big for the cache, well then, each benchmark loop will accurately represent that a hot cache isn’t going to help you much.

What @foobar_lv2’s data show is that the branch predictor’s history reaches far beyond what I (and I think it’s safe to say many) would have expected. So now we’re no longer measuring simply a hot cache, but also a hot branch-predictor, and code in-situ almost certainly won’t have a hot branch-predictor because, well, why in the world would you do the same operation on the same data many times… unless you’re trying to benchmark it.

findall(::Vector{Bool}) is a nice operation for this because, given random data, we’d expect a 50% branch miss across all repetitions of the benchmark… that is, if the branch-predictor hasn’t loaded the data itself into its history due to the repeated runs.

foobar_lv2 · November 13, 2018, 6:27pm

It all depends on what you try to test.

In this specific example, I tried to benchmark the speed of findall on a random density 0.5 Vector{Bool}. The problem is that a naive use of @btime on such a random vector measures something completely different:

The CPU has a very large and complex branch predictor. This branch predictor learns patterns; on intel by caching observed short patterns, and on AMD by doing something proprietary that involves training linear classifiers (perceptrons) on the observed history.

If I now benchmark @btime findall($vec) on a relatively short vector, then this operation is run very often, in order to average out clock jitter. In other words, we benchmark performance for running findall repeatedly on the same pattern.

Sometimes this is what you want. More often, you are interested in running findall on a novel, hitherto unseen pattern. This is somewhat like a volkswagen changing performance characteristics between usage on the road and usage in test labs; only without malice. Well, I would be very amused if popular benchmarks used for selling CPUs had a similar problem…

PS. @mbaumann was faster. I am saying exactly the same thing as he is.

Sukera · November 13, 2018, 6:31pm

The use cases for running the same operations over and over again that come to mind are games (a lot of repeating logic, but not really a thing in the julia ecosystem) and polling some data/inputs from external sources in regular intervals and doing the same operations with that new data (somewhat of a use case?). Those would definitely benefit from a hot branch predictor, though it’s debatable how subtle of an effect this would be in a complex scenario.

For me personally the takeaway from this isn’t “don’t trust microbenchmarks” but rather “Ryzen can find very long patterns in branching behaviour and can leverage that”. Definitely exciting!

traktofon · November 13, 2018, 6:32pm

My results for the 2nd version of the test script (only posting results from the 2nd run, the 1st run is very similar):

AMD Ryzen 5 1600X Six-Core Processor; branch-miss penalty:  5.7 ns = 18.1 cycles

Period     2:  124.76 us =    3.27 cycles per idx. Miss-rate  0.00%
Period   100:  140.55 us =    3.68 cycles per idx. Miss-rate  2.29%
Period   500:  166.08 us =    4.35 cycles per idx. Miss-rate  6.00%
Period  1000:  179.18 us =    4.70 cycles per idx. Miss-rate  7.90%
Period  2000:  212.15 us =    5.56 cycles per idx. Miss-rate 12.69%
Period  2500:  227.87 us =    5.97 cycles per idx. Miss-rate 14.97%
Period  3000:  234.02 us =    6.14 cycles per idx. Miss-rate 15.86%
Period  5000:  275.01 us =    7.21 cycles per idx. Miss-rate 21.81%
Period 10000:  322.39 us =    8.45 cycles per idx. Miss-rate 28.69%
Period 15000:  357.24 us =    9.37 cycles per idx. Miss-rate 33.75%
Period 20000:  376.15 us =    9.86 cycles per idx. Miss-rate 36.49%
Period 30000:  403.15 us =   10.57 cycles per idx. Miss-rate 40.41%
Period 40000:  421.22 us =   11.04 cycles per idx. Miss-rate 43.04%
Period 60000:  440.22 us =   11.54 cycles per idx. Miss-rate 45.79%

and for the 3rd version (N=480_000, 2nd run):

AMD Ryzen 5 1600X Six-Core Processor; branch-miss penalty:  6.3 ns = 24.1 cycles

Period     2:  500.41 us =    3.97 cycles per idx. Miss-rate  0.00%
Period   100:  577.76 us =    4.58 cycles per idx. Miss-rate  2.54%
Period   500:  671.20 us =    5.32 cycles per idx. Miss-rate  5.62%
Period  1000:  748.58 us =    5.93 cycles per idx. Miss-rate  8.16%
Period  2000:  873.85 us =    6.93 cycles per idx. Miss-rate 12.28%
Period  2500:  936.31 us =    7.42 cycles per idx. Miss-rate 14.34%
Period  3000:  955.76 us =    7.57 cycles per idx. Miss-rate 14.98%
Period  5000: 1088.03 us =    8.62 cycles per idx. Miss-rate 19.33%
Period 10000: 1313.59 us =   10.41 cycles per idx. Miss-rate 26.75%
Period 15000: 1426.15 us =   11.30 cycles per idx. Miss-rate 30.45%
Period 20000: 1512.96 us =   11.99 cycles per idx. Miss-rate 33.31%
Period 30000: 1616.78 us =   12.81 cycles per idx. Miss-rate 36.72%
Period 40000: 1682.83 us =   13.34 cycles per idx. Miss-rate 38.90%
Period 60000: 1772.00 us =   14.04 cycles per idx. Miss-rate 41.83%
Period 80000: 1852.40 us =   14.68 cycles per idx. Miss-rate 44.47%
Period 120000: 1902.91 us =   15.08 cycles per idx. Miss-rate 46.14%
Period 240000: 1989.35 us =   15.77 cycles per idx. Miss-rate 48.98%
Period 480000: 2007.97 us =   15.91 cycles per idx. Miss-rate 49.59%

mbauman · November 13, 2018, 6:34pm

Even if you’re running the same operation repeatedly, generally your data change (otherwise, why would you do it again)? Amusingly, learning branch predictions to this great extent could actually make real-life code (that is, same operation with new data) perform worse than it would with a completely random branch prediction. Classic overfitting.

Sukera · November 13, 2018, 6:37pm

Of course, but what I’m saying is that in a real world scenario, the data generally isn’t random, right? Those patterns in the data influence the branching behaviour and if Ryzen is able to learn them, that generally seems favourable for performance, even if it’s bad for getting an accurate measure in the general case with random data.

Tamas_Papp · November 13, 2018, 6:38pm

I find this discussion fascinating and scary at the same time.

Is there a practical recommendation about how I should benchmark in Julia to mitigate this effect? Eg use 10-100 randomly generated datasets, and cycle them?

kristoffer.carlsson · November 13, 2018, 6:39pm

You are still misunderstanding. The problem is not learning from the dataset we feed to the benchmark during that benchmark sample run. It is remembering the pattern from the previous benchmark run. This is definitely not what we want to measure.

Topic		Replies	Views
Out of order branch execution prediction slower? Performance	17	1014	May 26, 2020
Making @benchmark outputs statistically meaningful, and actionable Profiling benchmark , benchmarktools	123	3404	November 15, 2023
How I learned to stop worrying about being fastest and love microbenchmarks Community	22	3025	November 3, 2023
Identical functions repeated benchmarks show systematic differences Performance question , sort	37	2835	August 2, 2021
Chairmarks.jl Package Announcements memory-allocation , benchmark , speed-optimization	82	4527	March 12, 2024

PSA: Microbenchmarks remember branch history

Related topics