Rule of thumb to estimate theoretical lower bound on benchmarks?

Tetrakai · October 20, 2024, 4:56pm

I have a somewhat complex inner loop (I can share if needed but I am looking for general rules here) that benchmarks at 200-800 ns with no allocations. A range makes sense, it depends on which conditionals are randomly used each iteration.

I am very happy with this, but would like to roughly estimate how much performance I may be leaving on the table.

To start, I benchmarked some very basic functions:

Simple Functions

function plusrand(x)
    x + rand()
end

function plusrand2(x, y)
    return (x + rand(), y + rand())
end

function plusone(x)
    x + 1
end

function plusone2(x, y)
    return (x + 1, y + 1)
end

ulia> @benchmark plusrand(1)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.980 ns … 11.150 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.090 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.137 ns ±  0.193 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂▄ ▆█▄ ▁▁▁                                          
  ▂▁▂▁▃▄▆▂██▁███▁███▃▇▄▁▃▃▃▁▅▆▆▂▅▄▁▄▄▃▂▃▃▁▃▃▃▁▃▂▂▁▂▃▁▄▅▄▁▃▂▂ ▃
  2.98 ns        Histogram: frequency by time         3.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark plusrand2(1, 10)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.878 ns … 20.982 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.938 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.056 ns ±  0.326 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █                                                         
  ▅▁█▇▂▆▂▅▂▃▅▂▁▂▃▄▂▂▂▁▁▁▁▂▁▁▁▁▁▁▂▂▂▄▄▂▁▁▁▁▂▂▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▂ ▂
  7.88 ns        Histogram: frequency by time        9.01 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark plusone(1)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.220 ns … 5.530 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.240 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.278 ns ± 0.067 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆  █                    ▂      ▆                         
  ▅▁█▂▁█▂▁▃▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁█▂▁▁▆▁▁█▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▂▁▂ ▃
  1.22 ns        Histogram: frequency by time       1.41 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark plusone2(1, 10)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.240 ns … 8.301 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.330 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.324 ns ± 0.120 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▆            █                              
  ▃▁▁▄▁▁▆▁▁▁▁▁▃▁▁█▂▁▂▂▁▁▃▁▁▃▁▁█▁▁▄▁▁▂▂▁▁▃▁▁▁▁▁▃▁▁▃▁▁▅▁▁▄▂▁▂ ▂
  1.24 ns        Histogram: frequency by time       1.42 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

From this I would say my loop could not possibly be faster than roughly 1 ns, but more likely over 10 ns, and most likely at least a few multiples of that. Also, looking at min/max, even the range for integer addition spans ~5 ns. So 5-10 ns appears to set a kind of noise floor.

Thus I can’t be leaving more than 20-80x in performance on the table, most likely more like 2-10x though.

Is there something like number of (eg, addition/multiplication/conditional) operations that can be used to improve this bound?

This is meant to be only a rough estimate, maybe one order of magnitude accuracy (ie 10 ns, 100 ns).

Tetrakai · October 25, 2024, 3:44am

Seems like it would be interesting to estimate absolute, rather than only relative, performance.

Assuming it is still accurate, this looks relevant:

My takeaway is that each operation will take ~1-10 ns, so any moderately complicated loop will require ~ 100 ns. Then any benchmark under a microsecond means you are within an order of magnitude. Roughly.

Benny · October 25, 2024, 9:38am

I’d say the furthest you can go with any certainty is the native code (@code_native works, assuming the compiler specialized over all arguments in the call). As the Infographics link shows, the same machine’s performance over the “same” instruction depends on context. Something relevant it doesn’t mention is underclocking and overclocking, changes to how much time a CPU cycle takes. Benchmarks significantly slow down when my operating system decides to conserve power, which is reasonable given overheating or low battery but also occurs if I do something as benign as unplug a fully charged laptop (shown below). Time isn’t simply correlated to CPU cycles, and cycles aren’t simply correlated to instructions, so the most published benchmarks can do is give system details.

julia> @benchmark sin($2.5)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  6.700 ns … 100.500 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.200 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.390 ns ±   3.073 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █ ▆
  ▇▄▆▁█▃█▁█▄█▁▂▃▃▁▂▂▂▁▂▂▂▁▂▂▂▁▁▁▁▁▁▂▁▁▁▁▂▁▁▁▂▁▂▁▂▁▁▁▁▁▂▁▁▁▂▁▂ ▃
  6.7 ns          Histogram: frequency by time        11.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sin($2.5) # unplugged a fully charged laptop
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  11.411 ns … 361.361 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.228 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.009 ns ±  12.223 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▆██▆▂                 ▁▁▁▁▂▃▂▂▂▁▂▂▁          ▂
  ▆▇▇█▇▄▄▂▄▄▄▆▅▆▅▅█████▇▇▆▆▆▇▇▇▆▄▅▅▄▄▅▆████████████████▇▇▆▆▆▆▆ █
  11.4 ns       Histogram: log(frequency) by time      65.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sin($2.5) # 2 seconds after plugging back in
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  6.700 ns … 129.900 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.400 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.094 ns ±   6.710 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▂▃  ▁▃▃▁   ▁▂▁                                           ▂
  █████▆▆█████▆█████▇▅▄▄▁▃▁▃▅▅▄▁▃▅▄▄▅▃▃▄▅▁▁▁▄▄▁▁▁▄▄▅▅▆▆▆▆▅▆▇▇ █
  6.7 ns       Histogram: log(frequency) by time      44.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

PS I’d liberally $ non-const inputs in BenchmarkTools so they’re not treated as constants by the compiler, that could throw things way off.

Tetrakai · October 25, 2024, 1:28pm

Thanks, I had assumed typing the argument directly was treated as a constant rather than a variable (and so didn’t require interpolation).

If so, I’ll have to check some other benchmarks later.

I think, at least alone, tabulating number of various assembly operations doesn’t really get at the answer, because the algorithm itself could possibly be improved.

But it should be possible to look at the problem and say, eg, “any solution will require at least checking three conditionals, incrementing two values and assigning two values”.

Like comparing to the 1 + 1 benchmark in the OP. I don’t know how to prove it, but Im still confident an iteration of my loop can’t be faster than doing 1 + 1.

Also, as you say, the optimal code may depend on computing context (cpu, etc). So the estimate would be qualified with “in the current computing environment”.

mikmoore · October 25, 2024, 4:41pm

One can bound the complexity of an algorithm, usually. One of my favorite examples is matrix-matrix multiplication, whose complexity is lower bounded at O(N^2). The best theoretical algorithms are around O(N^{2.37}), the best “practical” algorithm is about O(N^{2.81}), and almost all implementations use standard O(N^3) algorithm because it runs very well on computers.

But translating a number or even a set of calculations to a wall time is often extremely difficult. In addition to automatic under/overclocking like discussed above, there are many different ways a processor can structure calculations. Note that processors have only increased from about 3GHz 20 years ago to around 5GHz now, but their computational ability has increased much more than clock speed alone would indicate (even on a single core). Significant parts of this come down to SIMD and difficult (for a layperson) to predict dark magic.

And despite my processor only operating at 3.8-5GHz, it can do 34B Float64 FMA (a combination of multiply and add) instructions per second on a single thread (measured by using LinearAlgebra; BLAS.set_num_threads(1); peakflops()/2, with the /2 because it counts FMA as 2 operations even though it’s more fair to treat it as 1). In other words, it’s averaging around 7 float calculations per clock cycle. If I switch to peakflops(;eltype=Float32)/2, this throughput basically doubles to 70B FMA/second (because Float32 can get double the SIMD benefit of Float64, though SIMD isn’t applicable to many calculations).

However, this 34B (or 70B Float32) number is pretty-much a best case situation (but if your processor supports 512 bit vector operations, you could get an additional 2x boost over mine which only has 256). It requires calculations to have specific structure so that SIMD instructions are applicable, it’s not limited by memory speed, and sequential data dependencies are not too tight.

Throughput on other calculations is usually tremendously lower. For example, my CPU benchmarks sin(1e0) in around 4.2ns and this uses (by my rough count) a little under 20 flops (4-5 flops per ns). And even sin is a pretty arithmetically-dense function, so “typical” calculations may be worse still. Branches, memory bandwidth, data dependencies, and other things can drop throughput precipitously.

A fantastic post on low-level performance tips can be found here.

So the short answer is “it takes some expertise to predict the performance of a block of code.” There isn’t really a nice rule of thumb and it depends heavily on what sort of calculation is being done.

If you have an implementation that “seems” slower than you think it should be, start by profiling it to find where the time is spent. Once you have isolated a small (<30 lines, ideally) slow section of code or MWE, check the performance tips and then make a post here asking for help optimizing it. If your example is small and self-contained, you’ll often get a decent number of responses.

Topic		Replies	Views
Distance to theoretical performance limit General Usage question	1	236	October 8, 2022
PSA: Microbenchmarks remember branch history Performance benchmark	36	4969	October 1, 2019
Yet another language benchmark Performance benchmark	9	926	June 15, 2025
Identical functions repeated benchmarks show systematic differences Performance question , sort	37	2835	August 2, 2021
Huge performance improvement by separating function? General Usage	15	1155	August 27, 2018

Rule of thumb to estimate theoretical lower bound on benchmarks?

Related topics