Should we enable storage of function outputs in BenchmarkTools?

gdalle · September 18, 2023, 6:00pm

Currently doing some maintenance on BenchmarkTools.jl before the upcoming 2.0 release (I’m aiming for October), and I stumbled upon a PR I’m unsure about.
The author @AlexanderNenninger wants to record the values returned by a function during each run of a benchmark. This is useful if we want to discard some runs where e.g. a simulation has failed, but it can also have nasty memory side effects. I think making it optional (and disabling it by default) is a good idea, but I would welcome more experienced opinions.

github.com/JuliaCI/BenchmarkTools.jl

Record Return Values in Trial

JuliaCI:master ← AlexanderNenninger:record-return-values

opened 01:37PM - 13 Jun 23 UTC

AlexanderNenninger

+103 -45

This PR is intended as a minor extension to the `Trial` and `TrialEstimate` type…s. It enables benchmarking of non-deterministic functions. This is useful in monte carlo settings to calculate success probabilities and expected runtimes. Currently given the code below ```julia using Random function mayfail() if rand() < 0.1 return "Returned early due to lazyness." end # Some expensive operations ... end suite = BenchmarkGroup() suite["mayfail"] = @benchmarkable mayfail() run(suite) ``` there is no way to determine post-benchmark, what the return value of a function was. ## Summary of Changes - Added `return_values::Vector{Any}` to `Trial` and `TrialEstimate` - Modified functions taking `Trial` and `TrialEstimate` where it makes sense. - `minimum` and `maximum` will retain their return values - other aggregation functions won't - `copy` makes a `deepcopy` of the return values. This is due to `copy` not being implemented for `String` (s. https://github.com/JuliaLang/julia/issues/31995#issuecomment-491327803) - If no return value is provided, `push!` et. al. default to pushing `nothing`. This adds a little memory overhead, since most `Trial`s are expected to contain a `Vector` of `nothing`. - The serialization tests in `tests/SerializationTests.jl` relied on all fields of `Trial` being comparable using `Base.approx`. This invariant is now broken. The local `eq` function has been modified. `eq` now falls back to `isequal` comparison, if `isapprox` is not defined for its arguments. - Added a few tests in `tests/TrialsTests.jl` - `copy(::TrialJudgement)` was broken independently of the changes listed above. Fixed and added test.

acxz · September 18, 2023, 6:58pm

Hello @gdalle! Don’t really think my opinion is experienced, but thought I’d put in my 2 cents.

This is definitely an interesting idea and upon your description, I almost feel like this would be useful for testing and such functionality more suitable for a test suite for correctness/results rather than BenchmarkTools which is more focused on timing and memory usage.

I do think that discarding runs (like failed simulations or early exits) might be fruitful. To tackle the memory side effects of recording all the return values, would it be more feasible to just record the memory size of the return value and use that to compare any off nominal return values? These off nominal return values can then be discarded from benchmark results or potentially used in other ways.

vchuravy · September 18, 2023, 7:16pm

My gut that storing the results is to expensive and might have unwanted side-effects.
You essentially a creating extra traffic on the memory subsystem, which may lead to cache evictions.

abraemer · September 18, 2023, 8:41pm

Wouldn’t it be enough to have a predicate function that checks the result and decides whether the run was successfull ansd should count towards the statistic or failed and should be dropped? Then we don’t need to store every result.

I agree with @vchuravy that we don’t want to perturb the measurements too much, so this predicate might be a good middle ground. If one really wishes to store all results then this predicate could even be abused for that purpose…

AlexanderNenninger · September 19, 2023, 6:39am

The original intent behind this PR was benchmarking non-deterministic algorithms. I’ve got a chaotic ODE simulation that’s highly sensitive on initial conditions. The modifications allow me to measure expected runtime.

In my use case the extra memory pressure of recording a few hundred results is negligible, but I can see how in many other cases it could be an issue.

The predicate idea seems pretty good, I wish I though of that sooner.

vchuravy · September 19, 2023, 1:07pm

I think there is also the question for when is BenchmarkTools appropriate to use?

The work it does on noise reduction and timing resolution enhancement stipes being useful around the one millisecond mark. I do see the use for a tool that gives you nice statistics and measurements for slower pieces but BenchmarkTools is overkill and you start getting different variability that Is not accounted for. (You might want to measure only on-cpu time and not time spend in the kernel, or not scheduled).

BT also assume determinism and doesn’t handle variability in the code super well, this is why it struggles with in place sort.

AlexanderNenninger · September 22, 2023, 1:46pm

After a bit of research I believe the best way would be to keep the current imlementation in place, but hide it behind a default keyword argument. In case the feature is not required, the return values will be a Vector{Nothing} which has almost no memory and computational overhead. (constant 40 bytes on my machine, independent of size) .

gdalle · September 22, 2023, 4:03pm

@vchuravy thoughts?

vchuravy · September 22, 2023, 10:54pm

I am still hesitant. Messing with the harness code is tricky and you add a conditional branch and rely on const-prop to remove it. I don’t see why this needs to be a BT feature.

AlexanderNenninger · September 23, 2023, 11:49am

We could enforce the branch selection at compile time using Val and/or use @generated. Anyhow, I feel like now we actually need some data to estimate the impact further. What benchmark cases and metrics are you particularly worried about?

mike.ingold · September 25, 2023, 8:45pm

For what it’s worth, as simply a casual user of BenchmarkTools and naive about the actual implementation, I’ll say that the proposed ability to record results was immediately appealing to me. I’ve had some cases where some code elements, like integrating with HCubature will run arbitrarily fast/slow with wonky but not obviously-wrong at runtime results, and having the ability to map input args to results with the corresponding times would’ve greatly helped debugging efforts. That being said, it sounds like the implementation of something like this is complicated and could impact the actual validity of the results, which is an understandably strong counter-argument.

AlexanderNenninger · September 26, 2023, 5:19am

We’ve decided against it. The implementation wouldn’t be too difficult, it’s just a bit of work to show there’s negligible performance impact. I’ll keep up the fork since I need it myself though.

jbrea · September 26, 2023, 7:09am

If the overhead is negligible, why not just wrap the function you want to benchmark in another one that keeps track of the results? For example:

julia> using BenchmarkTools

julia> function f() # function to benchmark
           a = rand()
           sleep(a)
           a
       end
f (generic function with 1 method)

julia> function g!(results) # wrapper
           a = f()
           push!(results, a)
       end
g! (generic function with 1 method)

julia> results = Float64[]
Float64[]

julia> benchmark_result = @benchmark g!($results) samples = 10 evals = 1

Topic		Replies	Views
Benchmarking Fallible Code General Usage benchmarktools	3	325	June 14, 2023
Retain computation results with @btime New to Julia	18	2208	December 23, 2024
Identical functions repeated benchmarks show systematic differences Performance question , sort	37	2835	August 2, 2021
Save benchmark results New to Julia	1	1499	January 22, 2018
Save @btime output New to Julia question , benchmarktools	7	3442	April 1, 2023

Should we enable storage of function outputs in BenchmarkTools?

Related topics