PSA: Microbenchmarks remember branch history

benchmark

#21

Yes, I’m aware that this is bad for benchmarking :slight_smile: All I’m saying is that at the same time, this can be a good thing for real data, where in general some patterns between runs can exist.

Apologies if I’ve expressed that poorly in my previous post.


#22

Now I’m wondering if Ryzen’s branch-predicting NN is actually learning patterns in Julia’s pseudo-RNG…


#23

I’m going to go out on a limb here and say this result is only really bad if you don’t know whether there’s a pattern or don’t know patterns in/metadata about the data you input into the function/thing you want to benchmark, e.g. when developing some core functions with unknown usage, like some very general purpose functions (matrix multiplication, findall and the like). In case you know there’s a pattern or have some metadata about what data you can expect to receive, it would be best to replicate those patterns in benchmark data to get an accurate reading for the benchmark, right? Of course, one should be careful with doing that to avoid over optimizing a function… It should also be disclosed if such a specialized benchmark is run and presented.


#24

Is there a practical recommendation about how I should benchmark in Julia to mitigate this effect?

I don’t know. Increasing your dataset to contain 5000+ points is probably good. Otherwise… we’ll need to see. I guess this thread is a good place to have this discussion. If https://github.com/JuliaCI/BenchmarkTools.jl/pull/92 gets merged, we should get the timing precision good enough to run with evals/sample = 1; then the proper way will be to provide a setup=... parameter that re-randomizes your data.

As a second example: if your benchmarked function involves a binary search (e.g. searchsortedfirst) then you have the same problem (the branching pattern encodes the position of your key). In theory small sorts should have the same problem. And if you were clever, and benchmarked a broadcast of the binary search over a small vector, then you can still be screwed because the vector is too small and nobody in his right mind expects the branch predictor to remember insanely long patterns. In theory, branch prediction + cache misses should combine to really strong effects, but I could not reproduce this. This is because the transient execution-- the thing that was speculated while waiting for the branch to resolve, which can be very long if a cache miss is involved-- can request and evict cache-lines (leading to spectre hilarity). This should manifest in a strong amplification of the timing effect of branch misses.

I guess we will see what the fallout is. I think we’ll start by reviewing the BaseBenchmarks and then hope that it will become clearer what to do / recommend in BenchmarkTools.


#25

I’d guess there’s at least one real-world example where learning long-ish patterns can be useful: for loops. Yes, the data you’re processing presumably differs from run-to-run, but the loop termination itself is a branch which might have repeated length. I have a hard time believing that one would see much benefit from being able to learn a single loop of length 2000, but of course if one has a pattern where you run a loop of length 2, then 3, then 4, then 5, then… (think of, e.g., Cholesky factorization), the overall period can grow to be longish and yet each loop is short. If you factor a whole bunch of 15x15 matrices over and over, to me it seems reasonable that accurate branch prediction could have a major positive benefit.

I’m not arguing that this isn’t misleading for many of the benchmarks (it is); I just want to point out that overall it’s a good thing CPUs do this.


#26

I think there are tons of real-world examples where learning long patterns pay off. For one thing, very few people write perfect code that never does repeat work.


#27

I saw a similar pattern:

AMD Ryzen Threadripper 1950X 16-Core Processor; branch-miss penalty:  6.2 ns = 22.5 cycles

Period     2:  361.70 us =    2.72 cycles per idx. Miss-rate  0.00%
Period   100:  477.46 us =    3.60 cycles per idx. Miss-rate  3.87%
Period   500:  599.56 us =    4.52 cycles per idx. Miss-rate  7.95%
Period  1000:  661.47 us =    4.98 cycles per idx. Miss-rate 10.02%
Period  2000:  726.98 us =    5.48 cycles per idx. Miss-rate 12.20%
Period  2500:  782.91 us =    5.90 cycles per idx. Miss-rate 14.07%
Period  3000:  916.26 us =    6.90 cycles per idx. Miss-rate 18.53%
Period  5000: 1000.20 us =    7.53 cycles per idx. Miss-rate 21.33%
Period 10000: 1245.99 us =    9.39 cycles per idx. Miss-rate 29.55%
Period 15000: 1309.33 us =    9.86 cycles per idx. Miss-rate 31.66%
Period 20000: 1391.70 us =   10.48 cycles per idx. Miss-rate 34.41%
Period 30000: 1497.16 us =   11.28 cycles per idx. Miss-rate 37.94%
Period 40000: 1587.96 us =   11.96 cycles per idx. Miss-rate 40.97%
Period 60000: 1655.30 us =   12.47 cycles per idx. Miss-rate 43.22%
Period 80000: 1739.87 us =   13.11 cycles per idx. Miss-rate 46.05%
Period 120000: 1847.58 us =   13.92 cycles per idx. Miss-rate 49.65%
Period 240000: 1895.98 us =   14.28 cycles per idx. Miss-rate 51.26%
Period 480000: 1860.58 us =   14.02 cycles per idx. Miss-rate 50.08%


AMD Ryzen Threadripper 1950X 16-Core Processor; branch-miss penalty:  6.2 ns = 22.5 cycles

Period     2:  359.88 us =    2.71 cycles per idx. Miss-rate  0.00%
Period   100:  466.77 us =    3.52 cycles per idx. Miss-rate  3.58%
Period   500:  564.70 us =    4.25 cycles per idx. Miss-rate  6.85%
Period  1000:  639.48 us =    4.82 cycles per idx. Miss-rate  9.35%
Period  2000:  731.45 us =    5.51 cycles per idx. Miss-rate 12.43%
Period  2500:  849.87 us =    6.40 cycles per idx. Miss-rate 16.39%
Period  3000:  876.82 us =    6.61 cycles per idx. Miss-rate 17.29%
Period  5000: 1005.03 us =    7.57 cycles per idx. Miss-rate 21.58%
Period 10000: 1232.19 us =    9.28 cycles per idx. Miss-rate 29.18%
Period 15000: 1310.84 us =    9.87 cycles per idx. Miss-rate 31.81%
Period 20000: 1448.19 us =   10.91 cycles per idx. Miss-rate 36.40%
Period 30000: 1493.49 us =   11.25 cycles per idx. Miss-rate 37.92%
Period 40000: 1634.33 us =   12.31 cycles per idx. Miss-rate 42.63%
Period 60000: 1657.22 us =   12.48 cycles per idx. Miss-rate 43.40%
Period 80000: 1693.99 us =   12.76 cycles per idx. Miss-rate 44.63%
Period 120000: 1754.76 us =   13.22 cycles per idx. Miss-rate 46.66%
Period 240000: 1820.66 us =   13.72 cycles per idx. Miss-rate 48.86%
Period 480000: 1825.42 us =   13.75 cycles per idx. Miss-rate 49.02%

#28

FWIW, I see equivalent behavior with the newer revision of Ryzen as well (second version, N=480_000):

AMD Ryzen 5 2600X Six-Core Processor; branch-miss penalty:  6.4 ns = 13.4 cycles

Period     2:  348.67 us =    1.53 cycles per idx. Miss-rate  0.00%
Period   100:  473.76 us =    2.08 cycles per idx. Miss-rate  4.10%
Period   500:  575.19 us =    2.53 cycles per idx. Miss-rate  7.42%
Period  1000:  671.69 us =    2.95 cycles per idx. Miss-rate 10.58%
Period  2000:  839.49 us =    3.69 cycles per idx. Miss-rate 16.08%
Period  2500:  812.51 us =    3.57 cycles per idx. Miss-rate 15.20%
Period  3000:  856.35 us =    3.76 cycles per idx. Miss-rate 16.63%
Period  5000:  987.01 us =    4.34 cycles per idx. Miss-rate 20.92%
Period 10000: 1206.00 us =    5.30 cycles per idx. Miss-rate 28.09%
Period 15000: 1380.67 us =    6.07 cycles per idx. Miss-rate 33.81%
Period 20000: 1443.07 us =    6.34 cycles per idx. Miss-rate 35.86%
Period 30000: 1511.92 us =    6.65 cycles per idx. Miss-rate 38.11%
Period 40000: 1568.28 us =    6.89 cycles per idx. Miss-rate 39.96%
Period 60000: 1741.56 us =    7.66 cycles per idx. Miss-rate 45.64%

#29

Btw, slightly off-topic: Could some Ryzen / Power / ARM owner post the output of https://github.com/maxbennedich/julia-scripts/blob/master/findall_benchmark.jl on 1.0.0, 1.0.1, or 1.0.2? Master has the relevant PR already merged.

Reason is that the new BitArray logical indexing / findall code is optimized pretty hard for intel timings of bit manipulation instructions (intel: fast BLSR, slow TZCNT / LZCNT; AMD: medium BLSR / TZCNT, fast LZCNT). Nobody on github ran tests on a non-intel CPU, so I am kinda curious how bad it is.

If I am reading the instruction tables right, then this should still be OK. But if the new code happens to perform really bad on ryzen/power/arm, then maybe someone should think about possible alternatives (e.g.: use AMD’s fast LZCNT to get fast reverse-order iteration and fill the output from the other end).

The output will be polluted by this issue of remembered branch history, but at least timings of the new non-branchy findall(V::BitVector) should be good enough to see whether there is a problem.


#30

This is on a Ryzen 5 1600:

findall_benchmark.jl
julia> include("findall.jl")
CPU speed: 3.79 GHz

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   69.66 μs |  676.26 ns | 2565.06 |    1.70 μs |   16.51 ns |   62.64 | 40.95 x
          100000 |    1.0 % |   75.86 μs |   75.34 ns |  285.75 |    2.73 μs |    2.71 ns |   10.28 | 27.79 x
          100000 |    5.0 % |   97.19 μs |   19.57 ns |   74.22 |    8.37 μs |    1.68 ns |    6.39 | 11.61 x
          100000 |   20.1 % |  167.39 μs |    8.32 ns |   31.57 |   25.38 μs |    1.26 ns |    4.79 |  6.60 x
          100000 |   50.0 % |  321.28 μs |    6.42 ns |   24.36 |   45.90 μs |    0.92 ns |    3.48 |  7.00 x
          100000 |   80.0 % |  179.27 μs |    2.24 ns |    8.50 |   74.24 μs |    0.93 ns |    3.52 |  2.41 x
          100000 |   99.0 % |   87.47 μs |    0.88 ns |    3.35 |   89.09 μs |    0.90 ns |    3.41 |  0.98 x
          100000 |  100.0 % |   82.88 μs |    0.83 ns |    3.14 |   57.23 μs |    0.57 ns |    2.17 |  1.45 x
       191 x 211 |    0.1 % |   38.61 μs |  877.43 ns | 3328.10 |    0.96 μs |   21.82 ns |   82.75 | 40.22 x
       191 x 211 |    1.0 % |   41.04 μs |  100.09 ns |  379.63 |    1.27 μs |    3.09 ns |   11.74 | 32.34 x
       191 x 211 |    5.1 % |   50.49 μs |   24.54 ns |   93.09 |    4.24 μs |    2.06 ns |    7.82 | 11.90 x
       191 x 211 |   20.2 % |   84.77 μs |   10.41 ns |   39.50 |   13.77 μs |    1.69 ns |    6.42 |  6.16 x
       191 x 211 |   50.1 % |  134.72 μs |    6.68 ns |   25.32 |   25.65 μs |    1.27 ns |    4.82 |  5.25 x
       191 x 211 |   80.0 % |   94.76 μs |    2.94 ns |   11.14 |   54.27 μs |    1.68 ns |    6.38 |  1.75 x
       191 x 211 |   99.0 % |   59.40 μs |    1.49 ns |    5.65 |   60.75 μs |    1.52 ns |    5.78 |  0.98 x
       191 x 211 |  100.0 % |   62.37 μs |    1.55 ns |    5.87 |   58.86 μs |    1.46 ns |    5.54 |  1.06 x
   15 x 201 x 10 |    0.1 % |   24.03 μs |  775.10 ns | 2939.94 |    1.32 μs |   42.67 ns |  161.86 | 18.16 x
   15 x 201 x 10 |    1.0 % |   27.00 μs |   87.65 ns |  332.47 |    2.37 μs |    7.69 ns |   29.18 | 11.39 x
   15 x 201 x 10 |    5.1 % |   37.26 μs |   24.35 ns |   92.36 |    5.17 μs |    3.38 ns |   12.83 |  7.20 x
   15 x 201 x 10 |   20.2 % |   65.06 μs |   10.69 ns |   40.56 |   19.44 μs |    3.19 ns |   12.12 |  3.35 x
   15 x 201 x 10 |   50.0 % |   99.89 μs |    6.63 ns |   25.14 |   32.13 μs |    2.13 ns |    8.09 |  3.11 x
   15 x 201 x 10 |   80.1 % |   74.51 μs |    3.08 ns |   11.70 |   58.59 μs |    2.42 ns |    9.20 |  1.27 x
   15 x 201 x 10 |   99.0 % |   41.58 μs |    1.39 ns |    5.28 |   48.87 μs |    1.64 ns |    6.21 |  0.85 x
   15 x 201 x 10 |  100.0 % |   62.37 μs |    2.07 ns |    7.85 |   60.20 μs |    2.00 ns |    7.57 |  1.04 x
 64 x 9 x 3 x 18 |    0.1 % |   22.41 μs |  659.06 ns | 2499.81 |    0.92 μs |   27.19 ns |  103.13 | 24.24 x
 64 x 9 x 3 x 18 |    1.0 % |   24.57 μs |   76.54 ns |  290.30 |    1.43 μs |    4.46 ns |   16.91 | 17.17 x
 64 x 9 x 3 x 18 |    5.1 % |   31.86 μs |   20.15 ns |   76.43 |    5.89 μs |    3.72 ns |   14.12 |  5.41 x
 64 x 9 x 3 x 18 |   20.1 % |   57.77 μs |    9.23 ns |   35.00 |   15.39 μs |    2.46 ns |    9.32 |  3.75 x
 64 x 9 x 3 x 18 |   50.0 % |   96.11 μs |    6.18 ns |   23.45 |   42.39 μs |    2.73 ns |   10.34 |  2.27 x
 64 x 9 x 3 x 18 |   80.2 % |   70.73 μs |    2.84 ns |   10.76 |   67.77 μs |    2.72 ns |   10.31 |  1.04 x
 64 x 9 x 3 x 18 |   99.0 % |   79.37 μs |    2.58 ns |    9.78 |   83.69 μs |    2.72 ns |   10.31 |  0.95 x
 64 x 9 x 3 x 18 |  100.0 % |   84.23 μs |    2.71 ns |   10.27 |   79.64 μs |    2.56 ns |    9.71 |  1.06 x

#31

With Julia 1.0.1

Power 9:

CPU speed: 3.80 GHz

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   90.20 μs |  875.71 ns | 3327.69 |    1.60 μs |   15.52 ns |   58.97 | 56.43 x
          100000 |    1.0 % |   93.51 μs |   92.86 ns |  352.85 |    4.01 μs |    3.98 ns |   15.12 | 23.34 x
          100000 |    5.0 % |  117.19 μs |   23.59 ns |   89.65 |   16.65 μs |    3.35 ns |   12.74 |  7.04 x
          100000 |   20.1 % |  242.06 μs |   12.04 ns |   45.74 |   39.66 μs |    1.97 ns |    7.49 |  6.10 x
          100000 |   50.0 % |  389.76 μs |    7.79 ns |   29.60 |   79.43 μs |    1.59 ns |    6.03 |  4.91 x
          100000 |   80.0 % |  258.46 μs |    3.23 ns |   12.28 |  118.43 μs |    1.48 ns |    5.63 |  2.18 x
          100000 |   99.0 % |  120.15 μs |    1.21 ns |    4.61 |  142.52 μs |    1.44 ns |    5.47 |  0.84 x
          100000 |  100.0 % |  111.71 μs |    1.12 ns |    4.25 |   23.93 μs |    0.24 ns |    0.91 |  4.67 x
       191 x 211 |    0.1 % |   53.86 μs | 1224.09 ns | 4651.55 |    0.97 μs |   22.03 ns |   83.71 | 55.57 x
       191 x 211 |    1.0 % |   56.08 μs |  136.79 ns |  519.78 |    2.38 μs |    5.81 ns |   22.09 | 23.53 x
       191 x 211 |    5.1 % |   64.90 μs |   31.55 ns |  119.89 |    7.06 μs |    3.43 ns |   13.04 |  9.19 x
       191 x 211 |   20.2 % |  110.15 μs |   13.53 ns |   51.42 |   19.23 μs |    2.36 ns |    8.97 |  5.73 x
       191 x 211 |   50.1 % |  171.85 μs |    8.51 ns |   32.36 |   37.94 μs |    1.88 ns |    7.14 |  4.53 x
       191 x 211 |   80.0 % |  120.64 μs |    3.74 ns |   14.21 |   53.82 μs |    1.67 ns |    6.34 |  2.24 x
       191 x 211 |   99.0 % |   66.29 μs |    1.66 ns |    6.32 |   64.17 μs |    1.61 ns |    6.11 |  1.03 x
       191 x 211 |  100.0 % |   63.76 μs |    1.58 ns |    6.01 |   32.86 μs |    0.82 ns |    3.10 |  1.94 x
   15 x 201 x 10 |    0.1 % |   58.11 μs | 1874.55 ns | 7123.28 |    1.89 μs |   60.90 ns |  231.42 | 30.78 x
   15 x 201 x 10 |    1.0 % |   59.79 μs |  194.11 ns |  737.63 |    3.63 μs |   11.79 ns |   44.79 | 16.47 x
   15 x 201 x 10 |    5.1 % |   64.61 μs |   42.23 ns |  160.46 |    8.98 μs |    5.87 ns |   22.29 |  7.20 x
   15 x 201 x 10 |   20.2 % |   97.54 μs |   16.03 ns |   60.92 |   29.23 μs |    4.80 ns |   18.26 |  3.34 x
   15 x 201 x 10 |   50.0 % |  145.18 μs |    9.63 ns |   36.61 |   44.01 μs |    2.92 ns |   11.10 |  3.30 x
   15 x 201 x 10 |   80.1 % |  105.31 μs |    4.36 ns |   16.56 |   57.29 μs |    2.37 ns |    9.01 |  1.84 x
   15 x 201 x 10 |   99.0 % |   67.25 μs |    2.25 ns |    8.56 |   51.88 μs |    1.74 ns |    6.61 |  1.30 x
   15 x 201 x 10 |  100.0 % |   65.55 μs |    2.17 ns |    8.26 |   59.41 μs |    1.97 ns |    7.49 |  1.10 x
 64 x 9 x 3 x 18 |    0.1 % |   75.48 μs | 2220.06 ns | 8436.22 |    0.98 μs |   28.85 ns |  109.64 | 76.94 x
 64 x 9 x 3 x 18 |    1.0 % |   77.09 μs |  240.15 ns |  912.58 |    2.55 μs |    7.95 ns |   30.20 | 30.22 x
 64 x 9 x 3 x 18 |    5.1 % |   82.81 μs |   52.38 ns |  199.04 |    6.76 μs |    4.28 ns |   16.25 | 12.25 x
 64 x 9 x 3 x 18 |   20.1 % |  117.46 μs |   18.76 ns |   71.28 |   16.02 μs |    2.56 ns |    9.72 |  7.33 x
 64 x 9 x 3 x 18 |   50.0 % |  168.92 μs |   10.87 ns |   41.30 |   31.86 μs |    2.05 ns |    7.79 |  5.30 x
 64 x 9 x 3 x 18 |   80.2 % |  127.72 μs |    5.12 ns |   19.46 |   47.94 μs |    1.92 ns |    7.30 |  2.66 x
 64 x 9 x 3 x 18 |   99.0 % |   85.14 μs |    2.77 ns |   10.51 |   58.28 μs |    1.89 ns |    7.19 |  1.46 x
 64 x 9 x 3 x 18 |  100.0 % |   83.46 μs |    2.68 ns |   10.20 |   75.90 μs |    2.44 ns |    9.27 |  1.10 x

Intel 4790k

CPU speed: 4.00 GHz

       size      | selected |  old time  |   per idx  |  cycles |  new time  |   per idx  |  cycles | speedup
-------------------------------------------------------------------------------------------------------------
          100000 |    0.1 % |   53.28 μs |  517.30 ns | 2069.20 |    1.11 μs |   10.75 ns |   42.99 | 48.14 x
          100000 |    1.0 % |   58.72 μs |   58.31 ns |  233.25 |    1.80 μs |    1.78 ns |    7.13 | 32.71 x
          100000 |    5.0 % |   77.91 μs |   15.69 ns |   62.74 |    8.02 μs |    1.61 ns |    6.46 |  9.72 x
          100000 |   20.1 % |  159.65 μs |    7.94 ns |   31.75 |   17.54 μs |    0.87 ns |    3.49 |  9.10 x
          100000 |   50.0 % |  251.55 μs |    5.03 ns |   20.11 |   37.70 μs |    0.75 ns |    3.01 |  6.67 x
          100000 |   80.0 % |  157.47 μs |    1.97 ns |    7.87 |   64.12 μs |    0.80 ns |    3.21 |  2.46 x
          100000 |   99.0 % |   77.91 μs |    0.79 ns |    3.15 |   81.38 μs |    0.82 ns |    3.29 |  0.96 x
          100000 |  100.0 % |   73.14 μs |    0.73 ns |    2.93 |   60.45 μs |    0.60 ns |    2.42 |  1.21 x
       191 x 211 |    0.1 % |   31.72 μs |  720.86 ns | 2883.45 |    0.55 μs |   12.50 ns |   50.01 | 57.66 x
       191 x 211 |    1.0 % |   33.34 μs |   81.31 ns |  325.24 |    0.94 μs |    2.30 ns |    9.21 | 35.32 x
       191 x 211 |    5.1 % |   40.29 μs |   19.59 ns |   78.35 |    2.88 μs |    1.40 ns |    5.60 | 13.98 x
       191 x 211 |   20.2 % |   69.88 μs |    8.58 ns |   34.33 |   10.51 μs |    1.29 ns |    5.16 |  6.65 x
       191 x 211 |   50.1 % |  117.26 μs |    5.81 ns |   23.24 |   24.98 μs |    1.24 ns |    4.95 |  4.70 x
       191 x 211 |   80.0 % |   77.26 μs |    2.40 ns |    9.58 |   34.13 μs |    1.06 ns |    4.23 |  2.26 x
       191 x 211 |   99.0 % |   51.08 μs |    1.28 ns |    5.12 |   48.09 μs |    1.21 ns |    4.82 |  1.06 x
       191 x 211 |  100.0 % |   49.82 μs |    1.24 ns |    4.94 |   48.78 μs |    1.21 ns |    4.84 |  1.02 x
   15 x 201 x 10 |    0.1 % |   20.47 μs |  660.26 ns | 2641.03 |    0.88 μs |   28.43 ns |  113.70 | 23.23 x
   15 x 201 x 10 |    1.0 % |   21.62 μs |   70.21 ns |  280.83 |    1.41 μs |    4.56 ns |   18.25 | 15.39 x
   15 x 201 x 10 |    5.1 % |   28.34 μs |   18.52 ns |   74.10 |    3.07 μs |    2.00 ns |    8.02 |  9.24 x
   15 x 201 x 10 |   20.2 % |   59.96 μs |    9.86 ns |   39.42 |   13.38 μs |    2.20 ns |    8.79 |  4.48 x
   15 x 201 x 10 |   50.0 % |   91.14 μs |    6.05 ns |   24.19 |   27.01 μs |    1.79 ns |    7.17 |  3.37 x
   15 x 201 x 10 |   80.1 % |   62.19 μs |    2.57 ns |   10.30 |   42.55 μs |    1.76 ns |    7.04 |  1.46 x
   15 x 201 x 10 |   99.0 % |   52.18 μs |    1.75 ns |    6.99 |   52.20 μs |    1.75 ns |    7.00 |  1.00 x
   15 x 201 x 10 |  100.0 % |   53.80 μs |    1.78 ns |    7.14 |   49.31 μs |    1.64 ns |    6.54 |  1.09 x
 64 x 9 x 3 x 18 |    0.1 % |   18.82 μs |  553.41 ns | 2213.65 |    0.53 μs |   15.68 ns |   62.73 | 35.29 x
 64 x 9 x 3 x 18 |    1.0 % |   21.13 μs |   65.82 ns |  263.28 |    1.12 μs |    3.49 ns |   13.97 | 18.85 x
 64 x 9 x 3 x 18 |    5.1 % |   28.70 μs |   18.16 ns |   72.62 |    4.37 μs |    2.76 ns |   11.05 |  6.57 x
 64 x 9 x 3 x 18 |   20.1 % |   52.25 μs |    8.34 ns |   33.37 |   14.34 μs |    2.29 ns |    9.16 |  3.64 x
 64 x 9 x 3 x 18 |   50.0 % |   85.26 μs |    5.49 ns |   21.94 |   36.34 μs |    2.34 ns |    9.35 |  2.35 x
 64 x 9 x 3 x 18 |   80.2 % |   64.84 μs |    2.60 ns |   10.40 |   58.75 μs |    2.36 ns |    9.42 |  1.10 x
 64 x 9 x 3 x 18 |   99.0 % |   74.27 μs |    2.41 ns |    9.65 |   70.42 μs |    2.29 ns |    9.15 |  1.05 x
 64 x 9 x 3 x 18 |  100.0 % |   74.51 μs |    2.40 ns |    9.58 |   75.90 μs |    2.44 ns |    9.76 |  0.98 x

A few observations

  • The performance for the 100% selected case was substantially better than the 99% case for the 1 dimensional tests. This effect appears to go away for the higher-dimensional tests.
  • For the 100% selected case in one and two dimensions, the Power9 system performs better than an Intel system, and performs similarly for the three and four dimension cases. For anything less than 100% selected, the Power9 system performs worse (in the neighborhood of 2x)

#32

Thanks!

Looks like the new code is also good on ryzen for realistic fill-rates. The 5 cycles for the inner loop on power9 don’t look so nice, but still better than the old code and not really bad.


#33

This is explained by a completely different code path being used for the 100% selected case (just returning all indices), and also different code paths for 1, 2 and 3+ dimensions. But indeed, that’s a much bigger difference than we’ve seen on other systems!