Benchmarking with PkgBenchmark.jl



Can you please share your workflow with PkgBenchmark.jl?

From what I understood in the README, we are supposed to manually run the benchmarks locally and compare the results between two commits/branches. I wonder how do you benchmark the master branch against the latest tagged version of your package and how do you automate this process.


I tried using the package, but given all the issues I had, I decided to eliminate PkgBenchmark.jl from my project.

Is there any well-maintained alternative for performance regression tests in Julia?


I don’t know of any, but would like to see this kind of stuff as part of Base’s testing functionality. Having this specifically be standardized across packages would be very useful given how performance obsessed Julia is.


I fully agree @ChrisRackauckas, we need this to be part of Base.Test.


Benchmarking infrastructure is quite large and complicated and not going to be added to Base. It may be added as a distributed-by-default package at some point once we have infrastructure for that. BenchmarkTools would be a good addition to JuliaPro as well. But the trend is towards taking code out of Base for 1.0 so the core language can be stable but external libraries can still be developed more flexibly. BenchmarkTools is widely used and well-maintained enough that it’s effectively standardized. If you’re unhappy with how well maintained PkgBenchmark is, either contribute to the package or to a replacement. Better tooling for package developers is important, but not something that the people working on the core language can dedicate a lot of attention to before 1.0.


Thank you @tkelman, I agree with what you said. It doesn’t necessarily need to be in Base, but we definitely need a standardized documented process for regression tests in Julia.


I am also a bit unhappy with the current state of package benchmarking tools. I’ll try get some work done on it, either as a new package or as PRs to the existing one in the coming weeks.


Perhaps this functionality could go into PkgDev at some future point? It does seem like having some quasi-standard package benchmarking system would be useful.


BenchmarkTools.jl is great and serves its purpose perfectly, but what seems to be missing is Nanosoldier in packages.

Yes, that exists, but it would be nice if packages could easily get setup with something like that. You can’t expect every contributor to know how to look for performance regressions, and so this is something I think would be essential for maintaining Julia packages. That said, I have no idea how Nanosoldier actually works and how it affects CI times (does it also run on Travis?), so this is just a suggestion without an actual plan. I would just like it to be easy for me or anyone else to be able to check timings before/with PRs, since just making sure a package works isn’t enough for packages where performance matters (yes, right now you can do things like @elapsed and check the timing, but that gets crazy because you have to account for different computers like Travis being really slow, and tuning those kinds of tests never seemed to work).


That’s what we are talking about here e.g. PkgBenchmark.jl


But does anyone have that setup? I’ve never actually seen it in use, and I’ve never been able to find out how to use it myself.


I’m pretty sure that Nanosoldier is a dedicated benchmarking server (and the corresponding package is specific to that server, and the Base benchmarks). Ideally we could use PkgBenchmark to create a similar setup anywhere, but that doesn’t seem to be possible at the moment.


I have been working with PkgBenchmark.jl all day so far and I have to say I am very happy with it. I went into it expecting lots of issues (because of the general vibe of this thread) but so far everything works smoothly and using it is also very simple. In fact I did not expect comparing benchmarks between different commits to be that straight forward.

I did, however, encounter a cognitive barrier with how BenchmarkTools.jl itself works (which PkgBenchmark.jl uses as backend).

I had a completely wrong intuition of how “tuning” works. In my mind I thought it tries to figure out how many samples to take (after “warming up” the function), and save that number to have consistency between different runs. Unsure where I got that idea but for some reason I didn’t question it until today. Turns out tuning just takes cares of estimating the evals/sample (which to be fair makes more sense).

Anyway, this misunderstanding caused me a bit of a headache because I struggled to get reasonable benchmark results for my problem. The reason for this was that the function I am interested in benchmarking takes around 6 seconds on first call and around 170 microseconds each subsequent call. The default and constant time budget for individual benchmarks is 5 seconds, which means that if I don’t invoke my function at least once before benchmarking, I only get one sample (the 6 seconds one).

Now the funny thing is that if I don’t have a “tune” file (which in this case I don’t even need), PkgBenchmark will create one, effectively invoking my function once before actually benchmarking it. Thus on the first commit I called it I got reasonable results. Now if I benchmarked again on a different commit (and the now existing tune file), I’d only get 1 sample of 6 seconds runtime.

Anyway, long story short, make sure to manually set the parameter for seconds for your problem if what you are interested in aren’t micro benchmarks. Then everything works nicely.

Another tip, which isn’t really visible on the PkgBenchmark readme, but apparent from looking at the code for @bench (see is that you can pass parameters such as setup, teardown, seconds etc etc to @bench as well.

Here a sample code from my benchmark/benchmarks.jl that you won’t be able to execute because the package isn’t public, but it shows how I use the package right now. Its surely not the final version, but it works

using WaveSimulator
using PkgBenchmark

@benchgroup "simulation" ["simulate", "simulate!"] begin
    for (resource, tags) in ((CPU1(), ["CPU", "CPU1"]),
                             (CPUThreads((100,1,1)), ["CPU", "CPUThreads"]),
                             (CUDALibs(), ["GPU"]))
        @benchgroup "$(typeof(resource)" tags begin
                simulate!(state, backend, sim),
                setup = begin
                    wave = UniformWave{3}(fmax=2e3)
                    sim = Simulator(wave, resource=$(resource), duration=0.01)
                    domain = BoxDomain(6,8,4, gamma=0.05)
                    f0 = WaveSimulator.gauss(domain)
                    backend = WaveSimulator.backend_init(sim.resource, domain, sim)
                    state   = WaveSimulator.state_init(f0, backend, domain, sim)
                teardown = begin
                    backend = nothing
                    state = nothing
                seconds = 20,
                samples = 100

To store a result for your commit on the current machine just call

julia> using PkgBenchmark

julia> res = benchmarkpkg("WaveSimulator"); showall(res)
INFO: Running benchmarks...
Creating benchmark tuning file /home/csto/.julia/v0.6/.benchmarks/WaveSimulator/.tune.jld
File results of this run? (commit=c0be5c, resultsdir=/home/csto/.julia/v0.6/.benchmarks/WaveSimulator/results) (Y/n) y
INFO: Results of the benchmark were written to /home/csto/.julia/v0.6/.benchmarks/WaveSimulator/results/c0be5c6045d034316011623cb395ffccb18b8a08.jld
1-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "simulation" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: ["simulate", "simulate!"]
          "CUDALibs" => 1-element BenchmarkTools.BenchmarkGroup:
                  tags: ["GPU"]
                  "simulate!" => Trial(168.666 μs)
          "CPUThreads" => 1-element BenchmarkTools.BenchmarkGroup:
                  tags: ["CPU", "CPUThreads"]
                  "simulate!" => Trial(38.190 ms)
          "CPU1" => 1-element BenchmarkTools.BenchmarkGroup:
                  tags: ["CPU", "CPU1"]
                  "simulate!" => Trial(180.768 ms)

Now make changes to your package. It is then quite simple compare the current state with some given commit

julia> using PkgBenchmark

julia> cmp = judge("WaveSimulator", "c0be5c6")
INFO: Running benchmarks...
Using benchmark tuning data in /home/csto/.julia/v0.6/.benchmarks/WaveSimulator/.tune.jld
WARNING: /home/csto/.julia/v0.6/WaveSimulator is dirty, not attempting to file results...
INFO: Reading results for c0be5c from /home/csto/.julia/v0.6/.benchmarks/WaveSimulator/results
1-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "simulation" => 3-element BenchmarkGroup(["simulate", "simulate!"])

julia> showall(cmp)
1-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "simulation" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: ["simulate", "simulate!"]
          "CUDALibs" => 1-element BenchmarkTools.BenchmarkGroup:
                  tags: ["GPU"]
                  "simulate!" => TrialJudgement(+3.52% => invariant)
          "CPUThreads" => 1-element BenchmarkTools.BenchmarkGroup:
                  tags: ["CPU", "CPUThreads"]
                  "simulate!" => TrialJudgement(-0.17% => invariant)
          "CPU1" => 1-element BenchmarkTools.BenchmarkGroup:
                  tags: ["CPU", "CPU1"]
                  "simulate!" => TrialJudgement(+1.16% => invariant)


I just started using PkgBenchmark.jl and find it very convenient.

I thought I would ask here instead of opening new topic: is there some (rudimentary) tool for summarizing/visualizing PkgBenchmark results over time/commits? Eg a nice graph of how a particular benchmark developed over time. Or even a table or something like that.


Does @Tamas_Papp still use PkgBenchmark.jl, or switched to something else? Have you found a way to visualize the results?


I am still using PkgBenchmark.jl, but did not progress with the visualization. I don’t have time for this ATM, so if anyone feels like working on it…