How to run benchmark CI on every commit?

I’m working on a package, that I want to benchmark at every commit, because I want to see how the performance evolves over time.

A short note on this idea

I assume that it’s not the best workflow, would be better to run the benchmarks at every PR, but the package is still in an early stage (not yet ready for publication), and I’m working on it alone, so running the benchmarks commit by commit seems fine for now.

My expectations

I’d like to run the same benchmark for every commit. This doesn’t have to be “real time”, as I run this while I’m coding, but want to have a consistent measurement. I want to store the results somewhere (in git repo in a csv file or something like that). I want to do it free.

The problem

AFAIK I can’t use GitHub’s and GitLab’s CI, because the jobs can run on any machine (which kills the goal of consistency).

My concept

I have access to a Linux VM running in our institution’s cloud (I’m waiting for confirmation that the hardware is not changing). I would write a script that clone’s the package and runs the benchmark. It’s ok if I have to start it manually (e.g. at the end of day), but it should traverse through all the commits (on #master) that haven’t been benchmarked yet. Then it should commit and push the results to a benchmark repo (generate markdown/html based on the csv and publish on gh/gl pages, etc.).

(Other solution I thought of is installing a GitLab server for myself, but that doesn’t sound like an easy way to do this.)

My questions

  1. What do you think? Is it reasonable? Any other way to do this?
  2. How to traverse through commits? (That is my main question.)

I’m open for any help, advice or opinion!

1 Like

It’s not only about the hardware but also about other users running things on the same machine while you are running the benchmarks.

For generating the markdown you could use


It’s actually straightforward to setup GitLab CI runner (not entire GitLab service) on machines you control It’s easy if you can use docker Once you setup and register the GitLab runner service, you can initiate the job by pushing to But this requires the runner service to keep running in the background. So I’m not sure if it fits with your need.

FYI, I setup benchmarks on Travis CI and run a couple of benchmarks: I was worried that it may fluctuate a lot. But it turned out Travis is consistent enough for my need. This is because I only care about the performance relative to the baseline implementation I write.


Great to see you got PkgBenchmarks working with Travis CI! Only quickly looking over your CI benchmark scripts in Transducers.jl, TransducersBenchmarksReports.jl, and Run.jl it doesn’t seem trivial to set it up, though. I’d really appreciate a couple of minimal examples for this.

Related to this, I never got a response here: PkgBenchmark.jl workflow

Thanks! I’ll keep that in mind. (As far as I know and understand, our VMs are bound to x number of cores and y RAM so this should not be an issue.)

I’ll definitely check out!

Hmm, so that’s how DiffEqBot works (I guess). Keep it running should not be an issue, I’ll read through the docs how that works.

I’ll check the code, thank you!

I’m not doing anything complicated actually. I just use benchmarkpkg to run the benchmark, use readresults to load it, use export_markdown to create a markdown file, and then use Documenter to generate the github page. I’m using Run.script to set up a project automatically before running the script, but you can replace it with Pkg.instantiate and Pkg.activate. Note that the CI for Transducers.jl is not related (even though I run benchmark there as a smoke test).

(A bit more sophisticated approach would be to use judge w.r.t. the reference revision (say the last release). But I haven’t gotten there yet.)

I’ve been playing around a bit with PkgBenchmark.judge, and I highly recommend it. It really makes it easy to, for example, compare the performance of a pull request with the performance of master.

Yea, PkgBenchmark.judge is great for deved local repositories. But IIRC it doesn’t work well with Manifest.toml. I think I know how to set it up. It’s just that I don’t need it at the moment.

But I do use judge for comparing with manual implementation (within the same commits): This part is a bit tricky and requires to touch PkgBenchmark internals a bit

1 Like

We use the same setup and it’s really easy to implement. The nice thing is: you can register a runner on a dedicated machine where you explicitly limit the number of parallel instances to one to avoid resource clashes.

1 Like

If you don’t mind using non-julia the following stackexchange suggestion might be worth exploring. I’m sure there are similar libraries/applications in other languages… this example uses Ruby:

Finally I wrote a script that goes through all the commits and benchmarks the code with PkgBenchmark. That’s an awesome package, thank you for it! (Although the docs are outdated, I addressed that in a PR).

Thank you for the confirmation, I may play with runners in the near future.

I wanted to go with julia, because that’s the language I’m most confident with. But thanks for the idea of non-julia scripts, next time I’ll consider a broader set of ideas.