Easy GitHub benchmarking with new AirspeedVelocity.jl

AirspeedVelocity.jl now has a marketplace GitHub Action!

I’m excited to announce a marketplace GitHub Action for AirspeedVelocity.jl, making it very easy to measure benchmarks in pull requests to your Julia package. These show you the time AND memory changes, for all defined benchmarks, against your default branch. It even will track startup time for you.

Quickstart

You need to follow BenchmarkTools.jl formatting: define a file benchmark/benchmarks.jl (with an optional benchmark/Project.toml) that defines a SUITE:

using MyPackage: my_eval
const SUITE = BenchmarkGroup()
SUITE["my_eval"] = @benchmarkable my_eval(x) setup=(x=randn(100))

If you have done this, all you need to do now is add this workflow file to .github/workflows/benchmark.yml:

name: Benchmark this PR
on:
  pull_request_target:
    branches: [ master ]  # or your default branch
permissions:
  pull-requests: write    # needed to post comments

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: MilesCranmer/AirspeedVelocity.jl@action-v1
        with:
          julia-version: '1.10'

That’s it! Now every PR will include clear, collapsible benchmark reports directly in a GitHub comment:

You can click to expand these, which gives you detailed comparisons before/after the PR:

The benchmarks also automatically include time_to_load which is computed by restarting Julia several times:

^You can see the memory benchmarks include both allocations and bytes.

You can also benchmark over multiple Julia versions! Just use the normal strategy: matrix: ... approach. These versions will show up as separate comments in the thread:

(Edit: I’ve also now added a “job-summary” mode which dumps the benchmark to the action page, rather than to github comments)

Check out the full documentation here.

Happy benchmarking!

57 Likes

Did you consider using Job Summaries instead of messages to the pull request? Thy would not clutter the discussion in PRs.

8 Likes

Thanks, that looks cool, I hadn’t seen it before.

The current action uses peter-evans/create-or-update-comment which tends to be good at updating existing comments, meaning it should only be 1 comment per PR (and no notifications after the 1st). I guess one downside is that multiple versions benchmarked => multiple comments. But haven’t worried about this much yet.

2 Likes

This looks cool! Do you think this could offer longitudinal records of package performance? I’m thinking a page like codecov does where you see the change commit-to-commit.

I’m also interested to hear how you work around the variance in GitHub runner performance? Do you run some sort of representative/calibration workload at the start and hope that the other tasks being juggled on the VM don’t change much while the benchmarked tasks are running?

3 Likes

Within a single job, I have found the Github runner performance to be decently consistent (actually more than my laptop, due to all the apps running).

There is, however, a decent amount of variability across jobs. Presumably they might be running on different machines.

So the benchmark just does the comparison within a single job, and this is what gets printed. There is no longterm tracking of performance statistics.

You can, however, do this locally. The --rev option let’s you pass any number of commits. And it saves the resultant benchmarks to files, so you could then plot performance over time as desired.

Very nice effort! As some more inspiration, for Makie, we’ve written some code that plots similar metrics, and it posts the summary plots to a gist because sadly, you cannot add images to comments programmatically, nor to workflow summaries. But it’s nicer than the table we had before, because visual outlier detection is much nicer than relying on summary stats, I think. And the noise can be considerable. Here’s a link to a random PR’s run:

Looks like:

7 Likes

Thanks! This is a brilliant idea. Perhaps we can auto-generate plots like these for the benchpkgplot command in AirspeedVelocity? If you’re interested and have some bandwidth, I’d love to have your help incorporating that idea!

At the moment, benchpkgplot (enabled in the GitHub action with enable-plots: 'true') just generates simple error bar plots like this:


and stores it to the build artifacts. But now I am embarrassed because it literally has access to the full table of times, and yet I never thought to plot the full distribution! :person_facepalming:

so hopefully shouldn’t be much effort to get this working.

I’d be happy to add other backends like Makie btw. PlotlyLight was chosen before extensions were a thing, in an effort to minimize build times, but now we can add or switch to other options.

3 Likes

Hey,
great package!

is there a way to add extra packages when using
MilesCranmer/AirspeedVelocity.jl@action-v1?

I would like to add extra packages with

`-a, --add <arg>`: Extra packages needed (delimit by comma).

but I’m unsure how to do this with the provided new action.

1 Like

Okay I just found

:wink:

Another question:
In some cases I extend the number of parameters that I pass to the function I want to bench.

However, the bench-on parameter makes it difficult to achieve, right? I’d like to bench my old code with the old bench script and my new code with the new bench script. Is that possible?

2 Likes

I normally just declare parameters in the benchmark name, like:

# Normal
SUITE["f"] = @benchmarkable ...

# New code
SUITE["f"]["0"] = @benchmarkable ...  # works on both
if isdefined(MyPkg, :new_method)
    for alpha in (0.1, 0.2)
        SUITE["f"][string(alpha)] = @benchmarkable ...  # only works on new
    end
end

It’s not an issue if there are new benchmarks on one version but not the other. They will both show up in the table. The ratio column will only be left empty if one of the before/after is missing.

Also note that there is a PACKAGE_VERSION constant available in the namespace within the benchmark script. AirspeedVelocity.jl defines this (here) so you can set up specific syntaxes for subsets of version history.

if PACKAGE_VERSION < v"1.0.0"
    SUITE["f"] = ...
elseif PACKAGE_VERSION < v"2.0.0"
    SUITE["f"]["0"] = ...
...
end

I recommend doing this instead of having separate scripts.

This makes complete sense, I do wonder to what extent a few small “calibration tasks” could be run to create a “normalised performance” metric that is meaningful across different GitHub CI runs? :thinking:

Just a showerthought.

1 Like

It’s not a bad idea. I’m not sure. Just seems like a lot of work. Because different runners might be faster/slower in different ways, rather than something that can be scaled linearly.

In a way, the “before” benchmark already kind of does this – it is a task-specific calibration. So looking at the ratio column might be the highest signal to noise

I had the same thought. It’s a little tricky, I’m just very keen for longitudinal performance data that’s built up over time.

Ah yes, that seems good! Keeping the “longitudinal cap” on, it occurs to me that one could even pick a small set of previous commits and benchmark each of them (I have a feeling that an exponentially distributed selection might be best — the 2 .^ (0:n) last commits)

I think this would have multiple advantages, since it wouldn’t just give more reference points but also (if run via CI on each push for instance) mean that we would have multiple measurements for multiple commits: improving accuracy while giving us the ability to gauge uncertainty — allowing us to perform hypothesis testing that a single PR/commit actually performs better/worse as well as inhibiting drift in the “calibration”.

For this purpose I think it is both easier and more accurate to run locally on a dedicated machine. You can do it with the benchpkg command:

benchpkg --rev=v0.2.0,v0.2.1,v0.2.2,v0.3.0,v0.3.1,master

One reason to not use “old” benchmark runs aggregated across GitHub actions is that dependencies may change, which could influence timings. Running the benchmarks all at once means you are only sensitive to changes in your package, rather than changes in the ecosystem.

Well sure, but I’m trying to trick someone (you? myself?) into making a nice pretty plot that “just happens” in the background :face_with_tongue:

Ah yes, that’s another good point.

Hmm, I still like the idea but it’s clear at this point that it’s more involved than I previously thought.

Quick update - per request, I have now added job summaries as an opt-in mode for the CI action.

name: Benchmark this PR
on:
  pull_request:        # <-- no need for pull_request_target
    branches: [ master ]

# <-- no permissions needed

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: MilesCranmer/AirspeedVelocity.jl@action-v1
        with:
          julia-version: '1'
          job-summary: 'true'  # <-- new option

It is a slightly out of the way (click on the benchmark run, then go to the “summary” tab and scroll down), but I think if you want to do a large matrix of benchmarks across many versions and configuration parameters, it is significantly cleaner than having individual comments in the PR.

6 Likes