PSA: Microbenchmarks remember branch history

WTF AMD. I honestly cannot decide whether to be impressed (want to buy a Ryzen now) or disgusted (distrust every published Ryzen benchmark). Both, I guess.

We might need to fix this in BenchmarkTools, at least partially. The fix there would be to use the TSC register for timing, ask users to use setup and then fix evals=1 if a setup is provided. But then we may need to look at the way setup and interpolation are used.

Can you run this once larger? We are now beyond testvector sizes that we can ask users to provide, but I just want to know the limit on the period of patterns Ryzen is capable of learning. It would be very interesting to know what kind of patterns Ryzen is good at, but unfortunately I couldn’t find a detailed account of its branch prediction algorithm.

julia> begin 
           for i=1:2
           N = 480_000
           list = fill(false, N); list[1:2:end].=true;
           bt0 = @belapsed findall($list)
           list .= rand(Bool, N)
           btL = @belapsed findall($list)
           time_to_cycle = 10^9/N * CPU_SPEED_GHZ
           penalty = 2*(btL-bt0)*time_to_cycle
           @printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
           cpu_model, penalty/CPU_SPEED_GHZ , penalty)

           bt = bt0
           @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 15_000, 20_000, 30_000, 40_000, 60_000, 80_000, 120_000, 240_000, 480_000]
               pat = rand(Bool, n)
               for i=1:n:N list[i:(i+n-1)].=pat end
               bt = @belapsed findall($list) #run thrice to give BPU more time to learn pattern
               bt = @belapsed findall($list)
               bt = @belapsed findall($list)
               @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           end
           end           
           end;
1 Like