Easy GitHub benchmarking with new AirspeedVelocity.jl

MilesCranmer · May 25, 2025, 10:55pm

AirspeedVelocity.jl now has a marketplace GitHub Action!

I’m excited to announce a marketplace GitHub Action for AirspeedVelocity.jl, making it very easy to measure benchmarks in pull requests to your Julia package. These show you the time AND memory changes, for all defined benchmarks, against your default branch. It even will track startup time for you.

Quickstart

You need to follow BenchmarkTools.jl formatting: define a file benchmark/benchmarks.jl (with an optional benchmark/Project.toml) that defines a SUITE:

using MyPackage: my_eval
const SUITE = BenchmarkGroup()
SUITE["my_eval"] = @benchmarkable my_eval(x) setup=(x=randn(100))

If you have done this, all you need to do now is add this workflow file to .github/workflows/benchmark.yml:

name: Benchmark this PR
on:
  pull_request_target:
    branches: [ master ]  # or your default branch
permissions:
  pull-requests: write    # needed to post comments

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: MilesCranmer/AirspeedVelocity.jl@action-v1
        with:
          julia-version: '1.10'

That’s it! Now every PR will include clear, collapsible benchmark reports directly in a GitHub comment:

You can click to expand these, which gives you detailed comparisons before/after the PR:

The benchmarks also automatically include time_to_load which is computed by restarting Julia several times:

^You can see the memory benchmarks include both allocations and bytes.

You can also benchmark over multiple Julia versions! Just use the normal strategy: matrix: ... approach. These versions will show up as separate comments in the thread:

(Edit: I’ve also now added a “job-summary” mode which dumps the benchmark to the action page, rather than to github comments)

Check out the full documentation here.

Happy benchmarking!

giordano · May 25, 2025, 11:03pm

Did you consider using Job Summaries instead of messages to the pull request? Thy would not clutter the discussion in PRs.

MilesCranmer · May 25, 2025, 11:11pm

Thanks, that looks cool, I hadn’t seen it before.

The current action uses peter-evans/create-or-update-comment which tends to be good at updating existing comments, meaning it should only be 1 comment per PR (and no notifications after the 1st). I guess one downside is that multiple versions benchmarked => multiple comments. But haven’t worried about this much yet.

tecosaur · May 26, 2025, 4:03am

This looks cool! Do you think this could offer longitudinal records of package performance? I’m thinking a page like codecov does where you see the change commit-to-commit.

I’m also interested to hear how you work around the variance in GitHub runner performance? Do you run some sort of representative/calibration workload at the start and hope that the other tasks being juggled on the VM don’t change much while the benchmarked tasks are running?

MilesCranmer · May 26, 2025, 8:41am

Within a single job, I have found the Github runner performance to be decently consistent (actually more than my laptop, due to all the apps running).

There is, however, a decent amount of variability across jobs. Presumably they might be running on different machines.

So the benchmark just does the comparison within a single job, and this is what gets printed. There is no longterm tracking of performance statistics.

You can, however, do this locally. The --rev option let’s you pass any number of commits. And it saves the resultant benchmarks to files, so you could then plot performance over time as desired.

jules · May 26, 2025, 10:37am

Very nice effort! As some more inspiration, for Makie, we’ve written some code that plots similar metrics, and it posts the summary plots to a gist because sadly, you cannot add images to comments programmatically, nor to workflow summaries. But it’s nicer than the table we had before, because visual outlier detection is much nicer than relying on summary stats, I think. And the noise can be considerable. Here’s a link to a random PR’s run:

github.com/MakieOrg/Makie.jl

Comment by MakieBot - Add sources to all Project.tomls

master ← as/sources

# Benchmark Results SHA: [b2fe32d9c019bb612af0054bdb2947796f130cda](https://git…hub.com/MakieOrg/Makie.jl/commit/b2fe32d9c019bb612af0054bdb2947796f130cda) > [!WARNING] > These results are subject to substantial noise because GitHub's CI runs on shared machines that are not ideally suited for benchmarking. ![GLMakie](https://gist.githubusercontent.com/MakieBot/fecf0cfb98fedf872e8a2f66f4bd82c8/raw/GLMakie.svg) ![CairoMakie](https://gist.githubusercontent.com/MakieBot/fecf0cfb98fedf872e8a2f66f4bd82c8/raw/CairoMakie.svg) ![WGLMakie](https://gist.githubusercontent.com/MakieBot/fecf0cfb98fedf872e8a2f66f4bd82c8/raw/WGLMakie.svg)

Looks like:

MilesCranmer · May 26, 2025, 11:48am

Thanks! This is a brilliant idea. Perhaps we can auto-generate plots like these for the benchpkgplot command in AirspeedVelocity? If you’re interested and have some bandwidth, I’d love to have your help incorporating that idea!

At the moment, benchpkgplot (enabled in the GitHub action with enable-plots: 'true') just generates simple error bar plots like this:

and stores it to the build artifacts. But now I am embarrassed because it literally has access to the full table of times, and yet I never thought to plot the full distribution!

github.com/MilesCranmer/AirspeedVelocity.jl

src/Utils.jl

5d46eb2ae


      
          @unstable function compute_summary_statistics(results)
              times = results["times"]
              d = Dict(
                  "mean" => mean(times),
                  "median" => median(times),
                  "memory" => get(results, "memory", nothing),
                  "allocs" => get(results, "allocs", nothing),
              )
              d = if length(times) > 1
                  merge(
                      d,
                      Dict(
                          "std" => std(times),
                          "25" => quantile(times, 0.25),
                          "75" => quantile(times, 0.75),
                      ),
                  )
              else
                  d
              end

This file has been truncated. show original

so hopefully shouldn’t be much effort to get this working.

I’d be happy to add other backends like Makie btw. PlotlyLight was chosen before extensions were a thing, in an effort to minimize build times, but now we can add or switch to other options.

zsoerenm · June 1, 2025, 12:51pm

Hey,
great package!

is there a way to add extra packages when using
MilesCranmer/AirspeedVelocity.jl@action-v1?

I would like to add extra packages with

`-a, --add <arg>`: Extra packages needed (delimit by comma).

but I’m unsure how to do this with the provided new action.

zsoerenm · June 1, 2025, 1:40pm

Okay I just found

github.com/MilesCranmer/AirspeedVelocity.jl

action.yml

5d46eb2ae


      
          extra-pkgs:    { default: '',             description: '--add extra packages (comma-sep)' }

Another question:
In some cases I extend the number of parameters that I pass to the function I want to bench.

However, the bench-on parameter makes it difficult to achieve, right? I’d like to bench my old code with the old bench script and my new code with the new bench script. Is that possible?

MilesCranmer · June 2, 2025, 3:01am

I normally just declare parameters in the benchmark name, like:

# Normal
SUITE["f"] = @benchmarkable ...

# New code
SUITE["f"]["0"] = @benchmarkable ...  # works on both
if isdefined(MyPkg, :new_method)
    for alpha in (0.1, 0.2)
        SUITE["f"][string(alpha)] = @benchmarkable ...  # only works on new
    end
end

It’s not an issue if there are new benchmarks on one version but not the other. They will both show up in the table. The ratio column will only be left empty if one of the before/after is missing.

Also note that there is a PACKAGE_VERSION constant available in the namespace within the benchmark script. AirspeedVelocity.jl defines this (here) so you can set up specific syntaxes for subsets of version history.

if PACKAGE_VERSION < v"1.0.0"
    SUITE["f"] = ...
elseif PACKAGE_VERSION < v"2.0.0"
    SUITE["f"]["0"] = ...
...
end

I recommend doing this instead of having separate scripts.

tecosaur · June 3, 2025, 3:58am

This makes complete sense, I do wonder to what extent a few small “calibration tasks” could be run to create a “normalised performance” metric that is meaningful across different GitHub CI runs?

Just a showerthought.

MilesCranmer · June 3, 2025, 6:08am

It’s not a bad idea. I’m not sure. Just seems like a lot of work. Because different runners might be faster/slower in different ways, rather than something that can be scaled linearly.

In a way, the “before” benchmark already kind of does this – it is a task-specific calibration. So looking at the ratio column might be the highest signal to noise

tecosaur · June 3, 2025, 8:08am

I had the same thought. It’s a little tricky, I’m just very keen for longitudinal performance data that’s built up over time.

Ah yes, that seems good! Keeping the “longitudinal cap” on, it occurs to me that one could even pick a small set of previous commits and benchmark each of them (I have a feeling that an exponentially distributed selection might be best — the 2 .^ (0:n) last commits)

I think this would have multiple advantages, since it wouldn’t just give more reference points but also (if run via CI on each push for instance) mean that we would have multiple measurements for multiple commits: improving accuracy while giving us the ability to gauge uncertainty — allowing us to perform hypothesis testing that a single PR/commit actually performs better/worse as well as inhibiting drift in the “calibration”.

MilesCranmer · June 3, 2025, 8:18am

For this purpose I think it is both easier and more accurate to run locally on a dedicated machine. You can do it with the benchpkg command:

benchpkg --rev=v0.2.0,v0.2.1,v0.2.2,v0.3.0,v0.3.1,master

One reason to not use “old” benchmark runs aggregated across GitHub actions is that dependencies may change, which could influence timings. Running the benchmarks all at once means you are only sensitive to changes in your package, rather than changes in the ecosystem.

tecosaur · June 3, 2025, 8:23am

Well sure, but I’m trying to trick someone (you? myself?) into making a nice pretty plot that “just happens” in the background

Ah yes, that’s another good point.

Hmm, I still like the idea but it’s clear at this point that it’s more involved than I previously thought.

MilesCranmer · June 4, 2025, 12:13pm

Quick update - per request, I have now added job summaries as an opt-in mode for the CI action.

name: Benchmark this PR
on:
  pull_request:        # <-- no need for pull_request_target
    branches: [ master ]

# <-- no permissions needed

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: MilesCranmer/AirspeedVelocity.jl@action-v1
        with:
          julia-version: '1'
          job-summary: 'true'  # <-- new option

It is a slightly out of the way (click on the benchmark run, then go to the “summary” tab and scroll down), but I think if you want to do a large matrix of benchmarks across many versions and configuration parameters, it is significantly cleaner than having individual comments in the PR.

Diego_Javier_Zea · June 28, 2025, 8:57am

Thanks a lot for this! After using a little, I think it would be great to have a little TLDR statement at the beginning (before the collapsibles) indicating the total number of regressions and improvements.

MilesCranmer · June 28, 2025, 3:08pm

Yes this could be nice.

I have a PR to add emojis here but haven’t finished yet: Emojis in ratio column by MilesCranmer · Pull Request #91 · MilesCranmer/AirspeedVelocity.jl · GitHub. If interested feel free to extend and I can review.

Diego_Javier_Zea · July 1, 2025, 9:21am

Hi Miles, I don’t have time to work on this right now, but the PR adding emojis looks super cool!

Topic		Replies	Views
[ANN] AirspeedVelocity.jl - easily benchmark Julia packages over their lifetime Package Announcements package , announcement , benchmark	7	1315	June 19, 2024
Recommendation for CI benchmark to catch regression Performance question , package , ci	6	925	November 19, 2023
How to run benchmark CI on every commit? General Usage benchmark , ci	13	2683	August 10, 2019
Github-action-benchmark with Julia support General Usage ci , github-actions	0	723	December 4, 2021
PkgBenchmark.jl workflow General Usage benchmark , benchmarktools	7	1766	April 12, 2020

Easy GitHub benchmarking with new AirspeedVelocity.jl

Quickstart

Related topics