Why does BenchmarkTools `@belapsed` make so many allocations?

For a long time, I have been hard-coding my own benchmarks using expressions like minimum(@elapsed f() for _ in 1:samples). It was brought to my attention that BenchmarkTools.jl does this kind of repeated sampling automatically, so I have been trying to switch over. However, BenchmarkTools.jl seems to incur much higher memory usage in a way that limits the size of benchmark study I can accomplish using my computer.

Why does the BenchmarkTools @belapsed macro cause so many allocations in the example below? Is there a way to prevent this?

julia> using BenchmarkTools

julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 1

(Here I have set the samples to 1 so that the BenchmarkTools macro is, in effect, equivalent to the normal @elapsed macro.)

julia> x = 5.0

julia> @time @elapsed sin(x)
  0.000004 seconds (1 allocation: 16 bytes)

julia> @time @belapsed sin(x)
  0.579649 seconds (549.22 k allocations: 10.147 MiB, 91.95% gc time, 5.33% compilation time)

julia> @time @belapsed sin($x)
  0.552421 seconds (44.52 k allocations: 2.397 MiB, 94.37% gc time, 4.06% compilation time)

Second run to allow compilation latency:

julia> @time @elapsed sin(x)
  0.000005 seconds (1 allocation: 16 bytes)

julia> @time @belapsed sin(x)
  0.558569 seconds (545.59 k allocations: 9.939 MiB, 92.03% gc time, 4.72% compilation time)

julia> @time @belapsed sin($x)
  0.551201 seconds (44.52 k allocations: 2.397 MiB, 93.56% gc time, 4.69% compilation time)

Similar but with five samples:

julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 5

julia> @time minimum(@elapsed sin(x) for _ in 1:5)
  0.032541 seconds (72.55 k allocations: 3.968 MiB, 99.26% compilation time)

julia> @time @belapsed sin(x)
  0.540074 seconds (549.64 k allocations: 10.001 MiB, 92.67% gc time, 4.19% compilation time)

julia> @time @belapsed sin($x)
  0.544774 seconds (44.58 k allocations: 2.399 MiB, 93.15% gc time, 4.74% compilation time)

BenchmarkTools.jl also leaks memory.
LoopVectorization.jl’s benchmarks would leak about 20G of memory by the time they’re done.
So my workaround was to use Distritbued, run benchmarks in worker processes, and then periodically rmproc(workers()) to free the memory and addprocs to replace the workers.

Of course, you could argue that this makes it less convenient than running your own benchmark with repeated @elapsed.


That’s my feeling too XD. It seems like BenchmarkTools.jl is really useful for if you want to do a quick A/B of two different functions, but to “benchmark” a whole package where you want to compute specific statistics over the computation times, manipulate the input sizes, and organize the whole thing in a DataFrame or table, hand-coding seems like the way to go.