Idiomatic Benchmarking

I feel like every time I write a benchmark for a package I’m reinventing the wheel, with slight changes in how I list dependencies, how I organize the actual benchmarks, how I collect results, and how I end up displaying them/incorporating them into docs. I’d like to hear anecdotes, get some links to packages, and find out some new tools I can use to address the following questions/problems and start writing better benchmarks

  1. Reproducibility should be paramount, across multiple languages, potentially. Is there a good solution for managing multiple languages’ manifest/requirements/lock files? Docker could be a solution but in the one instance I’ve tried there was a noticeable performance hit and writing the container to use Julia and Python was surprisingly difficult.
  2. should the emphasis be less on providing benchmarks for other users to run and be targeted for running on CI? For the latter, I think this simplifies the packaging problems since that will be hand-coded into the CI workflow, but I don’t know how reliable CI is for gauging performance- I’ve always relied on local tests.
  3. I typically run julia vs. python benchmarks and can leverage PyCall, but what about c++ code, or R code (I’m aware of RCall, are people confident in using it for benchmarking)?
  4. Every time I use BenchmarkTools I feel like I’m not taking full advantage of it. Are there any idiomatic examples of using the BenchmarkSuite
  5. How does/should writing benchmarks change when the goal is comparing to other software as opposed to testing for regressions. Ideally these both possible with the same tool. Are there any examples of the CI required to do this? Something I could see being typical is every, say, minor release gets a regression test, which will additionally create comparison benchmarks for the other software. If a particular PR is for performance, a review comment could trigger a test, too.
  6. Tied into (5), how do you organize your benchmarks? Do you use modules? Are they within the package or only accessible by cloning the repository (e.g. a benchmarks/ folder)
  7. I feel like it makes sense to collect all the benchmarks into CSV files and check them into git, then I’ll use @example blocks in the documentation to plot and display them. How does this compare to creating the plots within the benchmark?
  8. Something that could also be cool is creating benchmarks for numerical error (eg comparing vs. BigFloat), especially if this can be easily incorporated into the rest of the benchmarking tools

Packages/Tools
Here’s a list of packages/tools related to benchmarking, I’ll update here as people mention them.

1 Like

Regarding point 2, I very recently starting using BenchmarkCI.jl (and mentioned just a few hours ago in Zulip), see for example in Measurements.jl. In the Zulip thread we were also discussing about the fact GitHub runners aren’t completely reliable, sometimes I get totally unreasonable results, but at least I can run the regression benchmarks locally, that seems to work decently well.

1 Like

Something I’ve already noticed is that most of these packages seem focused on performance regression-testing. For example, BenchmarkSuite and BenchmarkGroup are useful structures for organizing the benchmarks, but going from these to, say, a plot in the package docs is not clear.

I never used it for that purpose, but I was under the impression you could do that with PkgBenchmark.jl

I once set up GitHub - tkf/TransducersBenchmarksReports.jl to report benchmarks with plots and tables using Documenter.jl e.g., GEMM · TransducersBenchmarksReports.jl. I haven’t been using it much, though.

I never used it for that purpose, but I was under the impression you could do that with PkgBenchmark.jl

Unfortunately I don’t see that anywhere in the docs.

Regarding 3, I don’t think there are issues with using RCall and bench or similar, I have found that to work quite well in practice.