[ANN] BenchmarkHistograms.jl

BenchmarkHistograms.jl is a simple package to provide a UnicodePlots.jl-powered show method for BenchmarkTools’ @benchmark. This idea was originally discussed in BenchmarkTools.jl#180.

BenchmarkHistograms works by exporting its own @benchmark macro which outputs a BenchmarkHistogram object holding the results (which in turn are obtained from BenchmarkTools.@benchmark), and it is for this object that a custom show method is defined. In other words, BenchmarkHistograms does not commit piracy and simply provides an alternative @benchmark.

For ease of use, BenchmarkHistograms re-exports all the other exports of BenchmarkTools. So one can simply call using BenchmarkHistograms instead of using BenchmarkTools, and the only difference will be that @benchmark’s show method will plot a histogram in addition to displaying the usual values.

Example

The following is a simple example from the README which is designed to show that for some benchmarks simply looking at summary statistics can miss out on important information.

julia> @benchmark 5 ∈ v setup=(v = sort(rand(1:10_000, 10_000)))

samples: 3192; evals/sample: 1000; memory estimate: 0 bytes; allocs estimate: 0
                       β”Œ                                        ┐ 
      [   0.0,  500.0) ─▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2036   
      [ 500.0, 1000.0) ─ 0                                        
      [1000.0, 1500.0) ─ 0                                        
   ns [1500.0, 2000.0) ─ 0                                        
      [2000.0, 2500.0) ─ 0                                        
      [2500.0, 3000.0) ─ 0                                        
      [3000.0, 3500.0) ─▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1156                  
                       β””                                        β”˜ 
                                        Counts
min: 1.875 ns (0.00% GC); mean: 1.141 ΞΌs (0.00% GC); median: 4.521 ns (0.00% GC); max: 3.315 ΞΌs (0.00% GC).

Here, we see a bimodal distribution; in the case 5 is indeed in the vector, we find it very quickly, in the 0-500 ns range (thanks to sort which places it at the front). In the case 5 is not present, we need to check every entry to be sure, and we end up in the 3000-3500 ns range.

Happy benchmarking!

49 Likes

very cool!

1 Like
julia> @benchmark bevalpoly!($y, $x)
samples: 10000; evals/sample: 200; memory estimate: 0 bytes; allocs estimate: 0
                     β”Œ                                        ┐
      [400.0, 410.0) ─▇▇▇▇▇▇▇▇▇▇▇ 2193
      [410.0, 420.0) ─▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6693
      [420.0, 430.0) ─▇▇▇▇ 826
      [430.0, 440.0) ─ 91
      [440.0, 450.0) ─ 75
   ns [450.0, 460.0) ─ 55
      [460.0, 470.0) ─ 33
      [470.0, 480.0) ─ 27
      [480.0, 490.0) ─ 5
      [490.0, 500.0) ─ 1
      [500.0, 510.0) ─ 0
      [510.0, 520.0) ─ 1
                     β””                                        β”˜
                                      Counts
min: 400.945 ns (0.00% GC); mean: 416.129 ns (0.00% GC); median: 419.205 ns (0.00% GC); max: 515.420 ns (0.00% GC).

julia> @benchmark evalpoly!($y, $x)
samples: 10000; evals/sample: 200; memory estimate: 0 bytes; allocs estimate: 0
                     β”Œ                                        ┐
      [400.0, 450.0) ─▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9873
      [450.0, 500.0) ─ 116
      [500.0, 550.0) ─ 2
   ns [550.0, 600.0) ─ 8
      [600.0, 650.0) ─ 0
      [650.0, 700.0) ─ 0
      [700.0, 750.0) ─ 0
      [750.0, 800.0) ─ 1
                     β””                                        β”˜
                                      Counts
min: 400.950 ns (0.00% GC); mean: 416.991 ns (0.00% GC); median: 419.235 ns (0.00% GC); max: 777.840 ns (0.00% GC).

Based on min, mean, and median times, these both performed almost exactly the same.
But the second has a lot of samples in the 550-600 range, and a much higher maximum, placing their histograms on different scales, making visual comparisons difficult.

If we have a huge number of samples (e.g., 10_000), it may be nice to trim off some of the extremes. That’d also help us see more of a breakdown in the 400-450 range in the above example.

1 Like

Good point, I’ve noticed this issue too. Any ideas for what the methodology should be for selecting outliers? My first guess would be the gap between the median and minimum would be a decent reference for the spread (based partly on https://juliaci.github.io/BenchmarkTools.jl/dev/manual/#Which-estimator-should-I-use?), and any if there are a small number of points that exceed the median plus some multiple of that gap, then maybe they are outliers. Maybe there is something more principled though.

For making a comparison between two benchmarks, I think there is probably something better we could do than just removing outliers, like plot the two distributions side-by-side with the same bins, or even on the same plot. I’m not sure what the right API for that would be; BenchmarkTools uses judge for comparing estimators (min/median/etc) and outputs invariant or improvement or regression along with the % change. We could add a method for judge for BenchmarkHistogram objects but I’m not sure we’d want to actually make a judgement as much as show a comparison plot so maybe that’s not the right function to use.

2 Likes

BenchmarkHistograms v0.2 is out and now collects possible β€œoutliers” into the last bin. E.g. using the example

f() = sum((sin(i) for i in 1:round(Int, 1000 + 100*randn())))

@benchmark f()

we get

samples: 10000; evals/sample: 4; memory estimate: 0 bytes; allocs estimate: 0
ns

 (7290.0  - 7570.0 ]  β–Ž18
 (7570.0  - 7850.0 ]  β–ˆβ–Ž94
 (7850.0  - 8140.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–Œ340
 (8140.0  - 8420.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž1091
 (8420.0  - 8700.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š1902
 (8700.0  - 8980.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ2316
 (8980.0  - 9260.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ1965
 (9260.0  - 9540.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹1282
 (9540.0  - 9820.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹580
 (9820.0  - 10100.0]  β–ˆβ–ˆβ–ˆβ–Œ262
 (10100.0 - 10390.0]  β–ˆ72
 (10390.0 - 10670.0]  β–Œ37
 (10670.0 - 10950.0]  ▍22
 (10950.0 - 11230.0]  ▏9
 (11230.0 - 29540.0]  β–Ž10

                  Counts

min: 7.292 ΞΌs (0.00% GC); mean: 8.923 ΞΌs (0.00% GC); median: 8.886 ΞΌs (0.00% GC); max: 29.541 ΞΌs (0.00% GC).

You can see the last bin is much larger than the others, spanning from 11230.0 to 29540.0. Before (using the exact same benchmark times, just replotting it), we would have had

samples: 10000; evals/sample: 4; memory estimate: 0 bytes; allocs estimate: 0
                         β”Œ                                        ┐ 
      [ 6000.0,  8000.0) ─▇ 218                                     
      [ 8000.0, 10000.0) ─▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9552   
      [10000.0, 12000.0) ─▇ 223                                     
      [12000.0, 14000.0) ─ 2                                        
      [14000.0, 16000.0) ─ 3                                        
   ns [16000.0, 18000.0) ─ 0                                        
      [18000.0, 20000.0) ─ 1                                        
      [20000.0, 22000.0) ─ 0                                        
      [22000.0, 24000.0) ─ 0                                        
      [24000.0, 26000.0) ─ 0                                        
      [26000.0, 28000.0) ─ 0                                        
      [28000.0, 30000.0) ─ 1                                        
                         β””                                        β”˜ 
                                          Counts
min: 7.292 ΞΌs (0.00% GC); mean: 8.923 ΞΌs (0.00% GC); median: 8.886 ΞΌs (0.00% GC); max: 29.541 ΞΌs (0.00% GC).

You can see the huge outlier at 30k ns force us to have many empty bins, and the bulk of the Gaussian distribution only has 3 bins.

There is a setting that controls this, BenchmarkHistograms.OUTLIER_QUANTILE, which defaults to 0.999. How it works is if the 99.9 percentile is more than 2 bins away from the true maximum, then we will consider them outliers and squash all those bins into a single bin. E.g. in this case, the 99.9 percentile is 11229.33.

We also dropped the UnicodePlots dependency in favor of some custom histogram code written by @brenhinkeller in https://github.com/JuliaCI/BenchmarkTools.jl/pull/180#issuecomment-711128281, which made it easier to customize this special binning behavior.

17 Likes

Does this package work with all versions of Julia from 1.0 to 1.6?

Yep!

1 Like

BenchmarkTools v1.1 provides fancy histogram printing by default using @benchmark (thanks to BenchmarkTools#217 by @tecosaur), so BenchmarkHistograms is not needed anymore and will probably not get any more development or releases. Thanks for the interest and Pkg.update("BenchmarkTools")!

julia> @benchmark 5 ∈ v setup=(v = sort(rand(1:10_000, 10_000)))
BechmarkTools.Trial: 3125 samples with 1000 evaluations.
 Range (min … max):  2.500 ns … 6.578 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     5.333 ns             β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.160 ΞΌs Β± 1.528 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                                                    β–‡ β–‚
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–ˆβ–‡ β–ˆ
  2.5 ns      Histogram: log(frequency) by time     3.36 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Edit: I realised some people credit me with helping with the upstream BenchmarkTools functionalityβ€” that was actually an entirely unrelated effort! I think it looks great though and am happy to retire BenchmarkHistograms (but I can’t claim any credit for it).

16 Likes