Should we enable storage of function outputs in BenchmarkTools?

Currently doing some maintenance on BenchmarkTools.jl before the upcoming 2.0 release (I’m aiming for October), and I stumbled upon a PR I’m unsure about.
The author @AlexanderNenninger wants to record the values returned by a function during each run of a benchmark. This is useful if we want to discard some runs where e.g. a simulation has failed, but it can also have nasty memory side effects. I think making it optional (and disabling it by default) is a good idea, but I would welcome more experienced opinions.


Hello @gdalle! Don’t really think my opinion is experienced, but thought I’d put in my 2 cents.

This is definitely an interesting idea and upon your description, I almost feel like this would be useful for testing and such functionality more suitable for a test suite for correctness/results rather than BenchmarkTools which is more focused on timing and memory usage.

I do think that discarding runs (like failed simulations or early exits) might be fruitful. To tackle the memory side effects of recording all the return values, would it be more feasible to just record the memory size of the return value and use that to compare any off nominal return values? These off nominal return values can then be discarded from benchmark results or potentially used in other ways.

My gut that storing the results is to expensive and might have unwanted side-effects.
You essentially a creating extra traffic on the memory subsystem, which may lead to cache evictions.


Wouldn’t it be enough to have a predicate function that checks the result and decides whether the run was successfull ansd should count towards the statistic or failed and should be dropped? Then we don’t need to store every result.

I agree with @vchuravy that we don’t want to perturb the measurements too much, so this predicate might be a good middle ground. If one really wishes to store all results then this predicate could even be abused for that purpose…


The original intent behind this PR was benchmarking non-deterministic algorithms. I’ve got a chaotic ODE simulation that’s highly sensitive on initial conditions. The modifications allow me to measure expected runtime.

In my use case the extra memory pressure of recording a few hundred results is negligible, but I can see how in many other cases it could be an issue.

The predicate idea seems pretty good, I wish I though of that sooner.


I think there is also the question for when is BenchmarkTools appropriate to use?

The work it does on noise reduction and timing resolution enhancement stipes being useful around the one millisecond mark. I do see the use for a tool that gives you nice statistics and measurements for slower pieces but BenchmarkTools is overkill and you start getting different variability that Is not accounted for. (You might want to measure only on-cpu time and not time spend in the kernel, or not scheduled).

BT also assume determinism and doesn’t handle variability in the code super well, this is why it struggles with in place sort.

1 Like

After a bit of research I believe the best way would be to keep the current imlementation in place, but hide it behind a default keyword argument. In case the feature is not required, the return values will be a Vector{Nothing} which has almost no memory and computational overhead. (constant 40 bytes on my machine, independent of size) .

@vchuravy thoughts?

I am still hesitant. Messing with the harness code is tricky and you add a conditional branch and rely on const-prop to remove it. I don’t see why this needs to be a BT feature.


We could enforce the branch selection at compile time using Val and/or use @generated. Anyhow, I feel like now we actually need some data to estimate the impact further. What benchmark cases and metrics are you particularly worried about?

For what it’s worth, as simply a casual user of BenchmarkTools and naive about the actual implementation, I’ll say that the proposed ability to record results was immediately appealing to me. I’ve had some cases where some code elements, like integrating with HCubature will run arbitrarily fast/slow with wonky but not obviously-wrong at runtime results, and having the ability to map input args to results with the corresponding times would’ve greatly helped debugging efforts. That being said, it sounds like the implementation of something like this is complicated and could impact the actual validity of the results, which is an understandably strong counter-argument.

1 Like

We’ve decided against it. The implementation wouldn’t be too difficult, it’s just a bit of work to show there’s negligible performance impact. I’ll keep up the fork since I need it myself though.


If the overhead is negligible, why not just wrap the function you want to benchmark in another one that keeps track of the results? For example:

julia> using BenchmarkTools

julia> function f() # function to benchmark
           a = rand()
f (generic function with 1 method)

julia> function g!(results) # wrapper
           a = f()
           push!(results, a)
g! (generic function with 1 method)

julia> results = Float64[]

julia> benchmark_result = @benchmark g!($results) samples = 10 evals = 1