PSA: Microbenchmarks remember branch history

Yes, I’m aware that this is bad for benchmarking :slight_smile: All I’m saying is that at the same time, this can be a good thing for real data, where in general some patterns between runs can exist.

Apologies if I’ve expressed that poorly in my previous post.

1 Like

Now I’m wondering if Ryzen’s branch-predicting NN is actually learning patterns in Julia’s pseudo-RNG…

2 Likes

I’m going to go out on a limb here and say this result is only really bad if you don’t know whether there’s a pattern or don’t know patterns in/metadata about the data you input into the function/thing you want to benchmark, e.g. when developing some core functions with unknown usage, like some very general purpose functions (matrix multiplication, findall and the like). In case you know there’s a pattern or have some metadata about what data you can expect to receive, it would be best to replicate those patterns in benchmark data to get an accurate reading for the benchmark, right? Of course, one should be careful with doing that to avoid over optimizing a function… It should also be disclosed if such a specialized benchmark is run and presented.

Is there a practical recommendation about how I should benchmark in Julia to mitigate this effect?

I don’t know. Increasing your dataset to contain 5000+ points is probably good. Otherwise… we’ll need to see. I guess this thread is a good place to have this discussion. If https://github.com/JuliaCI/BenchmarkTools.jl/pull/92 gets merged, we should get the timing precision good enough to run with evals/sample = 1; then the proper way will be to provide a setup=... parameter that re-randomizes your data.

As a second example: if your benchmarked function involves a binary search (e.g. searchsortedfirst) then you have the same problem (the branching pattern encodes the position of your key). In theory small sorts should have the same problem. And if you were clever, and benchmarked a broadcast of the binary search over a small vector, then you can still be screwed because the vector is too small and nobody in his right mind expects the branch predictor to remember insanely long patterns. In theory, branch prediction + cache misses should combine to really strong effects, but I could not reproduce this. This is because the transient execution-- the thing that was speculated while waiting for the branch to resolve, which can be very long if a cache miss is involved-- can request and evict cache-lines (leading to spectre hilarity). This should manifest in a strong amplification of the timing effect of branch misses.

I guess we will see what the fallout is. I think we’ll start by reviewing the BaseBenchmarks and then hope that it will become clearer what to do / recommend in BenchmarkTools.

1 Like

I’d guess there’s at least one real-world example where learning long-ish patterns can be useful: for loops. Yes, the data you’re processing presumably differs from run-to-run, but the loop termination itself is a branch which might have repeated length. I have a hard time believing that one would see much benefit from being able to learn a single loop of length 2000, but of course if one has a pattern where you run a loop of length 2, then 3, then 4, then 5, then… (think of, e.g., Cholesky factorization), the overall period can grow to be longish and yet each loop is short. If you factor a whole bunch of 15x15 matrices over and over, to me it seems reasonable that accurate branch prediction could have a major positive benefit.

I’m not arguing that this isn’t misleading for many of the benchmarks (it is); I just want to point out that overall it’s a good thing CPUs do this.

5 Likes

I think there are tons of real-world examples where learning long patterns pay off. For one thing, very few people write perfect code that never does repeat work.

I saw a similar pattern:

AMD Ryzen Threadripper 1950X 16-Core Processor; branch-miss penalty:  6.2 ns = 22.5 cycles

Period     2:  361.70 us =    2.72 cycles per idx. Miss-rate  0.00%
Period   100:  477.46 us =    3.60 cycles per idx. Miss-rate  3.87%
Period   500:  599.56 us =    4.52 cycles per idx. Miss-rate  7.95%
Period  1000:  661.47 us =    4.98 cycles per idx. Miss-rate 10.02%
Period  2000:  726.98 us =    5.48 cycles per idx. Miss-rate 12.20%
Period  2500:  782.91 us =    5.90 cycles per idx. Miss-rate 14.07%
Period  3000:  916.26 us =    6.90 cycles per idx. Miss-rate 18.53%
Period  5000: 1000.20 us =    7.53 cycles per idx. Miss-rate 21.33%
Period 10000: 1245.99 us =    9.39 cycles per idx. Miss-rate 29.55%
Period 15000: 1309.33 us =    9.86 cycles per idx. Miss-rate 31.66%
Period 20000: 1391.70 us =   10.48 cycles per idx. Miss-rate 34.41%
Period 30000: 1497.16 us =   11.28 cycles per idx. Miss-rate 37.94%
Period 40000: 1587.96 us =   11.96 cycles per idx. Miss-rate 40.97%
Period 60000: 1655.30 us =   12.47 cycles per idx. Miss-rate 43.22%
Period 80000: 1739.87 us =   13.11 cycles per idx. Miss-rate 46.05%
Period 120000: 1847.58 us =   13.92 cycles per idx. Miss-rate 49.65%
Period 240000: 1895.98 us =   14.28 cycles per idx. Miss-rate 51.26%
Period 480000: 1860.58 us =   14.02 cycles per idx. Miss-rate 50.08%


AMD Ryzen Threadripper 1950X 16-Core Processor; branch-miss penalty:  6.2 ns = 22.5 cycles

Period     2:  359.88 us =    2.71 cycles per idx. Miss-rate  0.00%
Period   100:  466.77 us =    3.52 cycles per idx. Miss-rate  3.58%
Period   500:  564.70 us =    4.25 cycles per idx. Miss-rate  6.85%
Period  1000:  639.48 us =    4.82 cycles per idx. Miss-rate  9.35%
Period  2000:  731.45 us =    5.51 cycles per idx. Miss-rate 12.43%
Period  2500:  849.87 us =    6.40 cycles per idx. Miss-rate 16.39%
Period  3000:  876.82 us =    6.61 cycles per idx. Miss-rate 17.29%
Period  5000: 1005.03 us =    7.57 cycles per idx. Miss-rate 21.58%
Period 10000: 1232.19 us =    9.28 cycles per idx. Miss-rate 29.18%
Period 15000: 1310.84 us =    9.87 cycles per idx. Miss-rate 31.81%
Period 20000: 1448.19 us =   10.91 cycles per idx. Miss-rate 36.40%
Period 30000: 1493.49 us =   11.25 cycles per idx. Miss-rate 37.92%
Period 40000: 1634.33 us =   12.31 cycles per idx. Miss-rate 42.63%
Period 60000: 1657.22 us =   12.48 cycles per idx. Miss-rate 43.40%
Period 80000: 1693.99 us =   12.76 cycles per idx. Miss-rate 44.63%
Period 120000: 1754.76 us =   13.22 cycles per idx. Miss-rate 46.66%
Period 240000: 1820.66 us =   13.72 cycles per idx. Miss-rate 48.86%
Period 480000: 1825.42 us =   13.75 cycles per idx. Miss-rate 49.02%

FWIW, I see equivalent behavior with the newer revision of Ryzen as well (second version, N=480_000):

AMD Ryzen 5 2600X Six-Core Processor; branch-miss penalty:  6.4 ns = 13.4 cycles

Period     2:  348.67 us =    1.53 cycles per idx. Miss-rate  0.00%
Period   100:  473.76 us =    2.08 cycles per idx. Miss-rate  4.10%
Period   500:  575.19 us =    2.53 cycles per idx. Miss-rate  7.42%
Period  1000:  671.69 us =    2.95 cycles per idx. Miss-rate 10.58%
Period  2000:  839.49 us =    3.69 cycles per idx. Miss-rate 16.08%
Period  2500:  812.51 us =    3.57 cycles per idx. Miss-rate 15.20%
Period  3000:  856.35 us =    3.76 cycles per idx. Miss-rate 16.63%
Period  5000:  987.01 us =    4.34 cycles per idx. Miss-rate 20.92%
Period 10000: 1206.00 us =    5.30 cycles per idx. Miss-rate 28.09%
Period 15000: 1380.67 us =    6.07 cycles per idx. Miss-rate 33.81%
Period 20000: 1443.07 us =    6.34 cycles per idx. Miss-rate 35.86%
Period 30000: 1511.92 us =    6.65 cycles per idx. Miss-rate 38.11%
Period 40000: 1568.28 us =    6.89 cycles per idx. Miss-rate 39.96%
Period 60000: 1741.56 us =    7.66 cycles per idx. Miss-rate 45.64%

Btw, slightly off-topic: Could some Ryzen / Power / ARM owner post the output of https://github.com/maxbennedich/julia-scripts/blob/master/findall_benchmark.jl on 1.0.0, 1.0.1, or 1.0.2? Master has the relevant PR already merged.

Reason is that the new BitArray logical indexing / findall code is optimized pretty hard for intel timings of bit manipulation instructions (intel: fast BLSR, slow TZCNT / LZCNT; AMD: medium BLSR / TZCNT, fast LZCNT). Nobody on github ran tests on a non-intel CPU, so I am kinda curious how bad it is.

If I am reading the instruction tables right, then this should still be OK. But if the new code happens to perform really bad on ryzen/power/arm, then maybe someone should think about possible alternatives (e.g.: use AMD’s fast LZCNT to get fast reverse-order iteration and fill the output from the other end).

The output will be polluted by this issue of remembered branch history, but at least timings of the new non-branchy findall(V::BitVector) should be good enough to see whether there is a problem.

1 Like

This is on a Ryzen 5 1600:

findall_benchmark.jl
julia> include("findall.jl")
CPU speed: 3.79 GHz

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   69.66 μs |  676.26 ns | 2565.06 |    1.70 μs |   16.51 ns |   62.64 | 40.95 x
          100000 |    1.0 % |   75.86 μs |   75.34 ns |  285.75 |    2.73 μs |    2.71 ns |   10.28 | 27.79 x
          100000 |    5.0 % |   97.19 μs |   19.57 ns |   74.22 |    8.37 μs |    1.68 ns |    6.39 | 11.61 x
          100000 |   20.1 % |  167.39 μs |    8.32 ns |   31.57 |   25.38 μs |    1.26 ns |    4.79 |  6.60 x
          100000 |   50.0 % |  321.28 μs |    6.42 ns |   24.36 |   45.90 μs |    0.92 ns |    3.48 |  7.00 x
          100000 |   80.0 % |  179.27 μs |    2.24 ns |    8.50 |   74.24 μs |    0.93 ns |    3.52 |  2.41 x
          100000 |   99.0 % |   87.47 μs |    0.88 ns |    3.35 |   89.09 μs |    0.90 ns |    3.41 |  0.98 x
          100000 |  100.0 % |   82.88 μs |    0.83 ns |    3.14 |   57.23 μs |    0.57 ns |    2.17 |  1.45 x
       191 x 211 |    0.1 % |   38.61 μs |  877.43 ns | 3328.10 |    0.96 μs |   21.82 ns |   82.75 | 40.22 x
       191 x 211 |    1.0 % |   41.04 μs |  100.09 ns |  379.63 |    1.27 μs |    3.09 ns |   11.74 | 32.34 x
       191 x 211 |    5.1 % |   50.49 μs |   24.54 ns |   93.09 |    4.24 μs |    2.06 ns |    7.82 | 11.90 x
       191 x 211 |   20.2 % |   84.77 μs |   10.41 ns |   39.50 |   13.77 μs |    1.69 ns |    6.42 |  6.16 x
       191 x 211 |   50.1 % |  134.72 μs |    6.68 ns |   25.32 |   25.65 μs |    1.27 ns |    4.82 |  5.25 x
       191 x 211 |   80.0 % |   94.76 μs |    2.94 ns |   11.14 |   54.27 μs |    1.68 ns |    6.38 |  1.75 x
       191 x 211 |   99.0 % |   59.40 μs |    1.49 ns |    5.65 |   60.75 μs |    1.52 ns |    5.78 |  0.98 x
       191 x 211 |  100.0 % |   62.37 μs |    1.55 ns |    5.87 |   58.86 μs |    1.46 ns |    5.54 |  1.06 x
   15 x 201 x 10 |    0.1 % |   24.03 μs |  775.10 ns | 2939.94 |    1.32 μs |   42.67 ns |  161.86 | 18.16 x
   15 x 201 x 10 |    1.0 % |   27.00 μs |   87.65 ns |  332.47 |    2.37 μs |    7.69 ns |   29.18 | 11.39 x
   15 x 201 x 10 |    5.1 % |   37.26 μs |   24.35 ns |   92.36 |    5.17 μs |    3.38 ns |   12.83 |  7.20 x
   15 x 201 x 10 |   20.2 % |   65.06 μs |   10.69 ns |   40.56 |   19.44 μs |    3.19 ns |   12.12 |  3.35 x
   15 x 201 x 10 |   50.0 % |   99.89 μs |    6.63 ns |   25.14 |   32.13 μs |    2.13 ns |    8.09 |  3.11 x
   15 x 201 x 10 |   80.1 % |   74.51 μs |    3.08 ns |   11.70 |   58.59 μs |    2.42 ns |    9.20 |  1.27 x
   15 x 201 x 10 |   99.0 % |   41.58 μs |    1.39 ns |    5.28 |   48.87 μs |    1.64 ns |    6.21 |  0.85 x
   15 x 201 x 10 |  100.0 % |   62.37 μs |    2.07 ns |    7.85 |   60.20 μs |    2.00 ns |    7.57 |  1.04 x
 64 x 9 x 3 x 18 |    0.1 % |   22.41 μs |  659.06 ns | 2499.81 |    0.92 μs |   27.19 ns |  103.13 | 24.24 x
 64 x 9 x 3 x 18 |    1.0 % |   24.57 μs |   76.54 ns |  290.30 |    1.43 μs |    4.46 ns |   16.91 | 17.17 x
 64 x 9 x 3 x 18 |    5.1 % |   31.86 μs |   20.15 ns |   76.43 |    5.89 μs |    3.72 ns |   14.12 |  5.41 x
 64 x 9 x 3 x 18 |   20.1 % |   57.77 μs |    9.23 ns |   35.00 |   15.39 μs |    2.46 ns |    9.32 |  3.75 x
 64 x 9 x 3 x 18 |   50.0 % |   96.11 μs |    6.18 ns |   23.45 |   42.39 μs |    2.73 ns |   10.34 |  2.27 x
 64 x 9 x 3 x 18 |   80.2 % |   70.73 μs |    2.84 ns |   10.76 |   67.77 μs |    2.72 ns |   10.31 |  1.04 x
 64 x 9 x 3 x 18 |   99.0 % |   79.37 μs |    2.58 ns |    9.78 |   83.69 μs |    2.72 ns |   10.31 |  0.95 x
 64 x 9 x 3 x 18 |  100.0 % |   84.23 μs |    2.71 ns |   10.27 |   79.64 μs |    2.56 ns |    9.71 |  1.06 x
2 Likes

With Julia 1.0.1

Power 9:

CPU speed: 3.80 GHz

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   90.20 μs |  875.71 ns | 3327.69 |    1.60 μs |   15.52 ns |   58.97 | 56.43 x
          100000 |    1.0 % |   93.51 μs |   92.86 ns |  352.85 |    4.01 μs |    3.98 ns |   15.12 | 23.34 x
          100000 |    5.0 % |  117.19 μs |   23.59 ns |   89.65 |   16.65 μs |    3.35 ns |   12.74 |  7.04 x
          100000 |   20.1 % |  242.06 μs |   12.04 ns |   45.74 |   39.66 μs |    1.97 ns |    7.49 |  6.10 x
          100000 |   50.0 % |  389.76 μs |    7.79 ns |   29.60 |   79.43 μs |    1.59 ns |    6.03 |  4.91 x
          100000 |   80.0 % |  258.46 μs |    3.23 ns |   12.28 |  118.43 μs |    1.48 ns |    5.63 |  2.18 x
          100000 |   99.0 % |  120.15 μs |    1.21 ns |    4.61 |  142.52 μs |    1.44 ns |    5.47 |  0.84 x
          100000 |  100.0 % |  111.71 μs |    1.12 ns |    4.25 |   23.93 μs |    0.24 ns |    0.91 |  4.67 x
       191 x 211 |    0.1 % |   53.86 μs | 1224.09 ns | 4651.55 |    0.97 μs |   22.03 ns |   83.71 | 55.57 x
       191 x 211 |    1.0 % |   56.08 μs |  136.79 ns |  519.78 |    2.38 μs |    5.81 ns |   22.09 | 23.53 x
       191 x 211 |    5.1 % |   64.90 μs |   31.55 ns |  119.89 |    7.06 μs |    3.43 ns |   13.04 |  9.19 x
       191 x 211 |   20.2 % |  110.15 μs |   13.53 ns |   51.42 |   19.23 μs |    2.36 ns |    8.97 |  5.73 x
       191 x 211 |   50.1 % |  171.85 μs |    8.51 ns |   32.36 |   37.94 μs |    1.88 ns |    7.14 |  4.53 x
       191 x 211 |   80.0 % |  120.64 μs |    3.74 ns |   14.21 |   53.82 μs |    1.67 ns |    6.34 |  2.24 x
       191 x 211 |   99.0 % |   66.29 μs |    1.66 ns |    6.32 |   64.17 μs |    1.61 ns |    6.11 |  1.03 x
       191 x 211 |  100.0 % |   63.76 μs |    1.58 ns |    6.01 |   32.86 μs |    0.82 ns |    3.10 |  1.94 x
   15 x 201 x 10 |    0.1 % |   58.11 μs | 1874.55 ns | 7123.28 |    1.89 μs |   60.90 ns |  231.42 | 30.78 x
   15 x 201 x 10 |    1.0 % |   59.79 μs |  194.11 ns |  737.63 |    3.63 μs |   11.79 ns |   44.79 | 16.47 x
   15 x 201 x 10 |    5.1 % |   64.61 μs |   42.23 ns |  160.46 |    8.98 μs |    5.87 ns |   22.29 |  7.20 x
   15 x 201 x 10 |   20.2 % |   97.54 μs |   16.03 ns |   60.92 |   29.23 μs |    4.80 ns |   18.26 |  3.34 x
   15 x 201 x 10 |   50.0 % |  145.18 μs |    9.63 ns |   36.61 |   44.01 μs |    2.92 ns |   11.10 |  3.30 x
   15 x 201 x 10 |   80.1 % |  105.31 μs |    4.36 ns |   16.56 |   57.29 μs |    2.37 ns |    9.01 |  1.84 x
   15 x 201 x 10 |   99.0 % |   67.25 μs |    2.25 ns |    8.56 |   51.88 μs |    1.74 ns |    6.61 |  1.30 x
   15 x 201 x 10 |  100.0 % |   65.55 μs |    2.17 ns |    8.26 |   59.41 μs |    1.97 ns |    7.49 |  1.10 x
 64 x 9 x 3 x 18 |    0.1 % |   75.48 μs | 2220.06 ns | 8436.22 |    0.98 μs |   28.85 ns |  109.64 | 76.94 x
 64 x 9 x 3 x 18 |    1.0 % |   77.09 μs |  240.15 ns |  912.58 |    2.55 μs |    7.95 ns |   30.20 | 30.22 x
 64 x 9 x 3 x 18 |    5.1 % |   82.81 μs |   52.38 ns |  199.04 |    6.76 μs |    4.28 ns |   16.25 | 12.25 x
 64 x 9 x 3 x 18 |   20.1 % |  117.46 μs |   18.76 ns |   71.28 |   16.02 μs |    2.56 ns |    9.72 |  7.33 x
 64 x 9 x 3 x 18 |   50.0 % |  168.92 μs |   10.87 ns |   41.30 |   31.86 μs |    2.05 ns |    7.79 |  5.30 x
 64 x 9 x 3 x 18 |   80.2 % |  127.72 μs |    5.12 ns |   19.46 |   47.94 μs |    1.92 ns |    7.30 |  2.66 x
 64 x 9 x 3 x 18 |   99.0 % |   85.14 μs |    2.77 ns |   10.51 |   58.28 μs |    1.89 ns |    7.19 |  1.46 x
 64 x 9 x 3 x 18 |  100.0 % |   83.46 μs |    2.68 ns |   10.20 |   75.90 μs |    2.44 ns |    9.27 |  1.10 x

Intel 4790k

CPU speed: 4.00 GHz

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   53.28 μs |  517.30 ns | 2069.20 |    1.11 μs |   10.75 ns |   42.99 | 48.14 x
          100000 |    1.0 % |   58.72 μs |   58.31 ns |  233.25 |    1.80 μs |    1.78 ns |    7.13 | 32.71 x
          100000 |    5.0 % |   77.91 μs |   15.69 ns |   62.74 |    8.02 μs |    1.61 ns |    6.46 |  9.72 x
          100000 |   20.1 % |  159.65 μs |    7.94 ns |   31.75 |   17.54 μs |    0.87 ns |    3.49 |  9.10 x
          100000 |   50.0 % |  251.55 μs |    5.03 ns |   20.11 |   37.70 μs |    0.75 ns |    3.01 |  6.67 x
          100000 |   80.0 % |  157.47 μs |    1.97 ns |    7.87 |   64.12 μs |    0.80 ns |    3.21 |  2.46 x
          100000 |   99.0 % |   77.91 μs |    0.79 ns |    3.15 |   81.38 μs |    0.82 ns |    3.29 |  0.96 x
          100000 |  100.0 % |   73.14 μs |    0.73 ns |    2.93 |   60.45 μs |    0.60 ns |    2.42 |  1.21 x
       191 x 211 |    0.1 % |   31.72 μs |  720.86 ns | 2883.45 |    0.55 μs |   12.50 ns |   50.01 | 57.66 x
       191 x 211 |    1.0 % |   33.34 μs |   81.31 ns |  325.24 |    0.94 μs |    2.30 ns |    9.21 | 35.32 x
       191 x 211 |    5.1 % |   40.29 μs |   19.59 ns |   78.35 |    2.88 μs |    1.40 ns |    5.60 | 13.98 x
       191 x 211 |   20.2 % |   69.88 μs |    8.58 ns |   34.33 |   10.51 μs |    1.29 ns |    5.16 |  6.65 x
       191 x 211 |   50.1 % |  117.26 μs |    5.81 ns |   23.24 |   24.98 μs |    1.24 ns |    4.95 |  4.70 x
       191 x 211 |   80.0 % |   77.26 μs |    2.40 ns |    9.58 |   34.13 μs |    1.06 ns |    4.23 |  2.26 x
       191 x 211 |   99.0 % |   51.08 μs |    1.28 ns |    5.12 |   48.09 μs |    1.21 ns |    4.82 |  1.06 x
       191 x 211 |  100.0 % |   49.82 μs |    1.24 ns |    4.94 |   48.78 μs |    1.21 ns |    4.84 |  1.02 x
   15 x 201 x 10 |    0.1 % |   20.47 μs |  660.26 ns | 2641.03 |    0.88 μs |   28.43 ns |  113.70 | 23.23 x
   15 x 201 x 10 |    1.0 % |   21.62 μs |   70.21 ns |  280.83 |    1.41 μs |    4.56 ns |   18.25 | 15.39 x
   15 x 201 x 10 |    5.1 % |   28.34 μs |   18.52 ns |   74.10 |    3.07 μs |    2.00 ns |    8.02 |  9.24 x
   15 x 201 x 10 |   20.2 % |   59.96 μs |    9.86 ns |   39.42 |   13.38 μs |    2.20 ns |    8.79 |  4.48 x
   15 x 201 x 10 |   50.0 % |   91.14 μs |    6.05 ns |   24.19 |   27.01 μs |    1.79 ns |    7.17 |  3.37 x
   15 x 201 x 10 |   80.1 % |   62.19 μs |    2.57 ns |   10.30 |   42.55 μs |    1.76 ns |    7.04 |  1.46 x
   15 x 201 x 10 |   99.0 % |   52.18 μs |    1.75 ns |    6.99 |   52.20 μs |    1.75 ns |    7.00 |  1.00 x
   15 x 201 x 10 |  100.0 % |   53.80 μs |    1.78 ns |    7.14 |   49.31 μs |    1.64 ns |    6.54 |  1.09 x
 64 x 9 x 3 x 18 |    0.1 % |   18.82 μs |  553.41 ns | 2213.65 |    0.53 μs |   15.68 ns |   62.73 | 35.29 x
 64 x 9 x 3 x 18 |    1.0 % |   21.13 μs |   65.82 ns |  263.28 |    1.12 μs |    3.49 ns |   13.97 | 18.85 x
 64 x 9 x 3 x 18 |    5.1 % |   28.70 μs |   18.16 ns |   72.62 |    4.37 μs |    2.76 ns |   11.05 |  6.57 x
 64 x 9 x 3 x 18 |   20.1 % |   52.25 μs |    8.34 ns |   33.37 |   14.34 μs |    2.29 ns |    9.16 |  3.64 x
 64 x 9 x 3 x 18 |   50.0 % |   85.26 μs |    5.49 ns |   21.94 |   36.34 μs |    2.34 ns |    9.35 |  2.35 x
 64 x 9 x 3 x 18 |   80.2 % |   64.84 μs |    2.60 ns |   10.40 |   58.75 μs |    2.36 ns |    9.42 |  1.10 x
 64 x 9 x 3 x 18 |   99.0 % |   74.27 μs |    2.41 ns |    9.65 |   70.42 μs |    2.29 ns |    9.15 |  1.05 x
 64 x 9 x 3 x 18 |  100.0 % |   74.51 μs |    2.40 ns |    9.58 |   75.90 μs |    2.44 ns |    9.76 |  0.98 x

A few observations

  • The performance for the 100% selected case was substantially better than the 99% case for the 1 dimensional tests. This effect appears to go away for the higher-dimensional tests.
  • For the 100% selected case in one and two dimensions, the Power9 system performs better than an Intel system, and performs similarly for the three and four dimension cases. For anything less than 100% selected, the Power9 system performs worse (in the neighborhood of 2x)
3 Likes

Thanks!

Looks like the new code is also good on ryzen for realistic fill-rates. The 5 cycles for the inner loop on power9 don’t look so nice, but still better than the old code and not really bad.

This is explained by a completely different code path being used for the 100% selected case (just returning all indices), and also different code paths for 1, 2 and 3+ dimensions. But indeed, that’s a much bigger difference than we’ve seen on other systems!

I can’t say I fully understand this thread, but it makes me worry that all my benchmarks are bunkum. Which benchmarks can I trust? Are there guidelines on how an average user should benchmark their code to avoid the branch history problem? If the timings are not accurate for many benchmarks, are they at least accurate in a relative sense, so I can tell that foo() is faster than bar()?

It’s the equivalent of “astronomers find that the sun weight measurement was off by 0.001%!” It’s amazing that people in the Julia community have such commitments to accuracy, but unless you’re benchmarking really fast operations (below 100 ns?) or involved in helioseismology, it really shouldn’t matter.

4 Likes

The moral of this thread is that the results of microbenchmarks should always be treated as a rough guideline. This thread discusses one possible and particularly opaque way that microbenchmarks can misrepresent the performance impact of a change in your application, but there’s many others.

That’s an ideal frame of mind to be in when benchmarking :slight_smile: A quick prototype and microbenchmark can be a great guide for development, but the ultimate test is always to integrate foo() into your real application as a replacement for bar() and benchmark it in situ. Sometimes the change you thought would make so much difference turns out to have no effect on the larger application ¯_(ツ)_/¯

1 Like

The thing you need to remember: @btime foo() expands to a loop like (N)->(foo())().

Is this indicative of your true workload? If not, then how does it differ?

Is the difference substantial? “Substantial or not” depends on CPU architecture.

It is very well known that you need to account for warm caches (L1d,L2d,L3d), but you probably want to simulate warm instruction cache. It is pretty well known that measuring latency is fiendishly difficult (your CPU is superscalar, i.e. can execute multiple foo() calls in “hidden parallel” unless you take pains to introduce data dependencies the CPU cannot speculate around). It is well known that the CPU remembers short branch histories; i.e. if you want to simulate foo(arg) with unpredictable / random data, it is unhelpful to call foo multiple times with the same input data; instead you should have multiple different input data you run.

It is less well known, even though obvious in retrospect, that pretty long random periodic patterns can be remembered by the CPU. This is a semi-obvious effect of pattern matching in a branch history table, a la sequence alignment in genomes.

The other effects (warm data caches, latency-vs-throughput, don’t repeat the same function call on the same short data) are much more important than sneaky branch history pattern alignment, but this specific effect is so obscure that it needs specific mention (and it gave up to 4x speed differences in benchmarks that were crucial for deciding to switch out the implementation of some functionality in julia base).

[I am saying semi-obvious, even though I must admit that I didn’t think of it until encountering it in practice. An astute reader (which I am apparently not) would have guessed this effect upon learning of the BHT.]

3 Likes