Making @benchmark outputs statistically meaningful, and actionable

I see a huge disparity in the output of different runs of @benchmark, for the same code. See the following two outputs, made consecutively, with zero code changes and no restarting Julia:

BenchmarkTools.Trial: 161 samples with 1 evaluation.
 Range (min … max):  30.098 ms …  32.380 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.904 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.217 ms ± 632.639 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▅█▅█▂█▆ ▂                     ▃▅       ▂▃▅      
  ▄▄▁▄▅▄▅▄▇▄▅▄▇▅███████▅█▁▇▄▁▄▄▁▄▁▄▁▁▄▁▁▁▅▅▅▇▅██▇▇▄▅▇█▇███▅▁▅▇ ▄
  30.1 ms         Histogram: frequency by time         32.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 119 samples with 1 evaluation.
 Range (min … max):  40.943 ms … 44.543 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     41.448 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.238 ms ±  1.218 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃ ▂ ▅█▂                                                    
  ▆▅█▇█▆███▄▄▁▃▃▅▃▃▁▃▁▁▁▁▃▄▁▁▃▁▁▁▁▁▁▁▃▁▁▄▄▄▄▄▃▁▃▅▁▅▅▃▃▃▅▄▇▅▃▄ ▃
  40.9 ms         Histogram: frequency by time        44.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Note there is no garbage collection at play here.

There is a huge difference in the two outputs. There isn’t any overlap between the two histograms. In fact, there’s around a 30% difference between the slowest time in the first run and the fastest time in the next run! It’s clear that the histograms are therefore meaningless – they don’t reflect the real distribution of timings.

This makes results from @benchmark lack actionability. If I want to know whether a small code change leads to faster code, @benchmark cannot reliably help me answer that. I might randomly get a faster run for code that is in general slower.

So how is this happening?

I believe that this issue is due to alignment between working memory and cache lines. Different runs of the code will use memory blocks at different offsets from the start of a cache line. This can have a large impact on cache hits/misses and therefore performance. Usually we have no control over this at runtime (though it’s possible to ensure that memory blocks have a particular alignment if needed).

However, there’s a fix – we can make sure that the results from @benchmark reflect this variability. In order to do this, it needs to randomly change the offset of stack and heap allocated memory between evaluations of the code.

Indeed, we can even go further, and say whether the difference between two runs is statistically significant. For this we’d need a new macro, @bcompare.

About 3-4 years ago I watched a video of great talk by some CS Professor who explains this and has developed a benchmarking tool that accounts for it (not for Julia, for command line applications). Wish I could find it again, but I haven’t been able to work out the right keywords to find it in Google. If you know what I’m talking about, please can you share a link? Well worth a watch, here.


What is MWE. Need that to give any meaningful answer.


Answer to what, Archie? The post consists of statements, not questions. Well, there is one question, about a link to a talk, but I don’t think that requires a MWE.

The code I’m benchmarking is reasonably complex, and reducing it down to a MWE would be a lot of effort. At this point it’s not clear what use that MWE would serve. Do you not believe me about the timings? Have you never seen something similar?

Some type of of MWE would indeed be helpful here beucase although there are no code questions being asked, recommendations are being made to change/update/create new benchmarking tools. But before people agree that there is a problem, they need to be able to reproduce it themselves. I myself have seen somehwat varying timings, but usually I can trace the results to a “me problem” rather than a problem with BenchmarkTools.

Some questions I would like to test are, what if @benchmark is 5 times in a row? What about after a fresh session restart? Does it change if you use randomized inputs? What about setting the seconds keyword to increase the number of runs? Do the results come out the same? If I can’t test these questions with a MWE, then I can’t tell if there is a problem unless I am somehow able to write my own code with unstable benchmarks.

And its not about believing or not believing, its about figuring out where the issue lies before recommending code changes.


So I found the talk. Well worth a watch.

Two thoughts:

  1. I’ve found that benchmarks for short duration functions
    can be tricky and mysteriously variable due to all the hardware
    and OS and runtime thing that can affect operations.

  2. I notice that the total times of both benchmark runs are about
    4991 msec for the first and 4939 msec but the number of
    samples are 161 and 119 respectively. All of the difference
    between the two appears to be from the sample counts.

Maybe understanding why the number of samples are different
between runs could explain the timings or suggest a MWE?

Unfortunately not. @benchmark is simply trying to run evaluations for 5 seconds total. The number of samples is inversely proportional to the function evaluation time.


Have you tried running the benchmark repeatedly for an extended
time to collect overall variability?

Is it possible that the processor clock changed between the runs?

The whole point of @benchmark is that a single run shows the overall variability. That it doesn’t is precisely the problem I’m trying to highlight.

It isn’t.

Have you watched the video? Memory layout affects runtime. @benchmark doesn’t account for this. I’m not saying anything controversial.


This seems like an interesting proposition, despite a slightly confrontational tone :wink:
Do you know how to implement what you’re suggesting? I’m basically the sole maintainer of BenchmarkTools.jl but I’m not sure I’m knowledgeable enough about low-level memory stuff.


A habit of mine. Especially when people suggest or imply that I’ve simply made a mistake. Not that it’s impossible, mind.

My thought is to allocate a randomly sized (within the length of a cache line, usually 64/128 bytes) block on the stack, and another larger, but randomly sized, block on the heap between evaluations. This would be a simple hack.

However, I note that in the talk, Emery says (at 22:00) their tool ‘Stabilize’ directly plugs in to the llvm compiler, so it could potentially be used directly. This would be the ideal, since it does far more randomization. For example, if someone uses a test function that generates the data for benchmarking, the heap allocation layout will usually be the same, even with a slight offset. You really want something that jumbles everything up.

I could take a look at the BenchmarkTools code and try to do the former; see if it increases the variability as expected. I’ll also see if I can create a MWE. The problem is that the effect is hardware and OS dependent, so it might not reproduce in everyone’s environment.

1 Like

The whole point of @benchmark is that a single run shows the overall variability. That it doesn’t is precisely the problem I’m trying to highlight.

I was wondering what the distribution of benchmark results was
compared to the per-run results and statistics to understand exactly
how often and by how much this difference occurs.

It isn’t.

Have you watched the video? Memory layout affects runtime. @benchmark doesn’t account for this. I’m not saying anything controversial.

I’m not sure @benchmark corrects for varying CPU clock speed so
I wanted to confirm that this was not the issue.

I understand how memory layout can affect runtime and don’t think you
have been saying anything controversial.

I don’t know how to get the quoted text with user name in the link.
Apologies for that.

Rereading the posts, I don’t think anyone suggested or implied mistakes were made. I commented on my experience doing benchmarks, but I was simply answering your question asking “Have you never seen something similar?”. To which is the answer is “no”, I have not experienced variance that I couldn’t trace back to an issue of my own.

All the other posts are simply asking questions to obtain more knowledge about the problem since I/we do not have code that reproduces the issue, so we are relying on the information you provide to better understand what is happening. If the problem is statistical in nature (results change depending on circumstance), then it seems a sample size larger than one is necessary to properly understand it.

I’m also not saying-- or implying – that you’re wrong, I certainly have no experience in this area to make that judgement. But when someone makes a post showing incorrect/inconsistent results, the first step taken by those trying to help is almost universally: try to reproduce the results. That is all we are asking.


Have you tried passing a setup block that re-allocates/copies/shuffles around your data?

It could be caching/locality effects, but it could also be so many things. Modern computer architectures (and operating systems) are wild. From thermal throttling to (surprisingly long-lived) branch prediction to caching to multithreading to hyperthreading… you’re performance fine-tuning in a hostile environment.


Providing a MWE is the first step in collectively solving a problem here. See the post pinned to the top of the forum when you first land here:

Someone asking for a MWE should not at all be taken as offensive to you at all - it’s the expected request on this forum if you haven’t included one in your first post.

Without an MWE you are asking everyone to reproduce your work by guessing, instead of diving in and trying to understand your problem straight away.

You will get way more help and attention on any topic if you have a MWE.

julia> f() = sleep(rand(1:10))
f (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark f()
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 10.008 s (0.00% GC) to evaluate,
 with a memory estimate of 112 bytes, over 4 allocations.

julia> @benchmark f()
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.001 s … 10.008 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.004 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.004 s ±  5.662 s  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                      █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2 s           Histogram: frequency by time          10 s <

 Memory estimate: 112 bytes, allocs estimate: 4.

Unless we know anything about the code being benchmarked the results on their own are not surprising.


I don’t think you did say I was wrong. Or that you implied it. Nothing I said implied that. I can see why you inferred it. It’s easy to infer things that aren’t true.

Again, I never said anybody said anything offensive here.

I did say that an MWE would be tricky, for two different reasons. Did you notice?

My results are different from this. And you do know something about the code: 1. The code was run twice in succession. 2. It doesn’t allocate any memory.

I can tell you more:
3. It’s pure computation. No disk IO, no network IO, no screen output.

I would think that would make the results surprising.

Yes! And we account for most of them in the benchmark :smile: If we want to.

I will look into your setup block suggestion.