I see a huge disparity in the output of different runs of @benchmark
, for the same code. See the following two outputs, made consecutively, with zero code changes and no restarting Julia:
BenchmarkTools.Trial: 161 samples with 1 evaluation.
Range (min … max): 30.098 ms … 32.380 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 30.904 ms ┊ GC (median): 0.00%
Time (mean ± σ): 31.217 ms ± 632.639 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▅█▂█▆ ▂ ▃▅ ▂▃▅
▄▄▁▄▅▄▅▄▇▄▅▄▇▅███████▅█▁▇▄▁▄▄▁▄▁▄▁▁▄▁▁▁▅▅▅▇▅██▇▇▄▅▇█▇███▅▁▅▇ ▄
30.1 ms Histogram: frequency by time 32.2 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 119 samples with 1 evaluation.
Range (min … max): 40.943 ms … 44.543 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 41.448 ms ┊ GC (median): 0.00%
Time (mean ± σ): 42.238 ms ± 1.218 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ ▂ ▅█▂
▆▅█▇█▆███▄▄▁▃▃▅▃▃▁▃▁▁▁▁▃▄▁▁▃▁▁▁▁▁▁▁▃▁▁▄▄▄▄▄▃▁▃▅▁▅▅▃▃▃▅▄▇▅▃▄ ▃
40.9 ms Histogram: frequency by time 44.4 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Note there is no garbage collection at play here.
There is a huge difference in the two outputs. There isn’t any overlap between the two histograms. In fact, there’s around a 30% difference between the slowest time in the first run and the fastest time in the next run! It’s clear that the histograms are therefore meaningless – they don’t reflect the real distribution of timings.
This makes results from @benchmark
lack actionability. If I want to know whether a small code change leads to faster code, @benchmark
cannot reliably help me answer that. I might randomly get a faster run for code that is in general slower.
So how is this happening?
I believe that this issue is due to alignment between working memory and cache lines. Different runs of the code will use memory blocks at different offsets from the start of a cache line. This can have a large impact on cache hits/misses and therefore performance. Usually we have no control over this at runtime (though it’s possible to ensure that memory blocks have a particular alignment if needed).
However, there’s a fix – we can make sure that the results from @benchmark
reflect this variability. In order to do this, it needs to randomly change the offset of stack and heap allocated memory between evaluations of the code.
Indeed, we can even go further, and say whether the difference between two runs is statistically significant. For this we’d need a new macro, @bcompare
.
About 3-4 years ago I watched a video of great talk by some CS Professor who explains this and has developed a benchmarking tool that accounts for it (not for Julia, for command line applications). Wish I could find it again, but I haven’t been able to work out the right keywords to find it in Google. If you know what I’m talking about, please can you share a link? Well worth a watch, here.