Currently doing some maintenance on BenchmarkTools.jl before the upcoming 2.0 release (I’m aiming for October), and I stumbled upon a PR I’m unsure about.
The author @AlexanderNenninger wants to record the values returned by a function during each run of a benchmark. This is useful if we want to discard some runs where e.g. a simulation has failed, but it can also have nasty memory side effects. I think making it optional (and disabling it by default) is a good idea, but I would welcome more experienced opinions.
Hello @gdalle! Don’t really think my opinion is experienced, but thought I’d put in my 2 cents.
This is definitely an interesting idea and upon your description, I almost feel like this would be useful for testing and such functionality more suitable for a test suite for correctness/results rather than BenchmarkTools which is more focused on timing and memory usage.
I do think that discarding runs (like failed simulations or early exits) might be fruitful. To tackle the memory side effects of recording all the return values, would it be more feasible to just record the memory size of the return value and use that to compare any off nominal return values? These off nominal return values can then be discarded from benchmark results or potentially used in other ways.
My gut that storing the results is to expensive and might have unwanted side-effects.
You essentially a creating extra traffic on the memory subsystem, which may lead to cache evictions.
Wouldn’t it be enough to have a predicate function that checks the result and decides whether the run was successfull ansd should count towards the statistic or failed and should be dropped? Then we don’t need to store every result.
I agree with @vchuravy that we don’t want to perturb the measurements too much, so this predicate might be a good middle ground. If one really wishes to store all results then this predicate could even be abused for that purpose…
The original intent behind this PR was benchmarking non-deterministic algorithms. I’ve got a chaotic ODE simulation that’s highly sensitive on initial conditions. The modifications allow me to measure expected runtime.
In my use case the extra memory pressure of recording a few hundred results is negligible, but I can see how in many other cases it could be an issue.
The predicate idea seems pretty good, I wish I though of that sooner.
I think there is also the question for when is BenchmarkTools appropriate to use?
The work it does on noise reduction and timing resolution enhancement stipes being useful around the one millisecond mark. I do see the use for a tool that gives you nice statistics and measurements for slower pieces but BenchmarkTools is overkill and you start getting different variability that Is not accounted for. (You might want to measure only on-cpu time and not time spend in the kernel, or not scheduled).
BT also assume determinism and doesn’t handle variability in the code super well, this is why it struggles with in place sort.
After a bit of research I believe the best way would be to keep the current imlementation in place, but hide it behind a default keyword argument. In case the feature is not required, the return values will be a Vector{Nothing}
which has almost no memory and computational overhead. (constant 40 bytes on my machine, independent of size) .
I am still hesitant. Messing with the harness code is tricky and you add a conditional branch and rely on const-prop to remove it. I don’t see why this needs to be a BT feature.