I feel like every time I write a benchmark for a package I’m reinventing the wheel, with slight changes in how I list dependencies, how I organize the actual benchmarks, how I collect results, and how I end up displaying them/incorporating them into docs. I’d like to hear anecdotes, get some links to packages, and find out some new tools I can use to address the following questions/problems and start writing better benchmarks
- Reproducibility should be paramount, across multiple languages, potentially. Is there a good solution for managing multiple languages’ manifest/requirements/lock files? Docker could be a solution but in the one instance I’ve tried there was a noticeable performance hit and writing the container to use Julia and Python was surprisingly difficult.
- should the emphasis be less on providing benchmarks for other users to run and be targeted for running on CI? For the latter, I think this simplifies the packaging problems since that will be hand-coded into the CI workflow, but I don’t know how reliable CI is for gauging performance- I’ve always relied on local tests.
- I typically run julia vs. python benchmarks and can leverage PyCall, but what about c++ code, or R code (I’m aware of RCall, are people confident in using it for benchmarking)?
- Every time I use BenchmarkTools I feel like I’m not taking full advantage of it. Are there any idiomatic examples of using the
BenchmarkSuite
- How does/should writing benchmarks change when the goal is comparing to other software as opposed to testing for regressions. Ideally these both possible with the same tool. Are there any examples of the CI required to do this? Something I could see being typical is every, say, minor release gets a regression test, which will additionally create comparison benchmarks for the other software. If a particular PR is for performance, a review comment could trigger a test, too.
- Tied into (5), how do you organize your benchmarks? Do you use modules? Are they within the package or only accessible by cloning the repository (e.g. a
benchmarks/
folder) - I feel like it makes sense to collect all the benchmarks into CSV files and check them into git, then I’ll use
@example
blocks in the documentation to plot and display them. How does this compare to creating the plots within the benchmark? - Something that could also be cool is creating benchmarks for numerical error (eg comparing vs.
BigFloat
), especially if this can be easily incorporated into the rest of the benchmarking tools
Packages/Tools
Here’s a list of packages/tools related to benchmarking, I’ll update here as people mention them.
- BenchmarkTools.jl
- BenchmarkHelper.jl (I haven’t used this before- what do people think?)
- PkgBenchmark.jl (@giordano)
- BenchmarkCI.jl (@giordano)