Allocations and time of running the same program twice are orders of magnitude larger than running them separately

Yes, this seems to be it. So @benchmark does not seem to be a good way to evaluate the performance of a fixed point algorithm like mine. Writing a wrapper function of the following form (passing the other parameters):

function evaluate(δ_in, Xβ, params)
    δ_out = copy(δ_in)
    invert_shares_δ!(δ_out, Xβ, params)
    return δ_out
end

Gives the “correct” benchmark performance:

BenchmarkTools.Trial: 151 samples with 1 evaluation.
 Range (min … max):  18.512 ms … 378.933 ms  ┊ GC (min … max):  0.00% … 90.32%
 Time  (median):     20.174 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   34.223 ms ±  68.247 ms  ┊ GC (mean ± σ):  39.90% ± 18.29%

  █▁                                                            
  ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▅ ▄
  18.5 ms       Histogram: log(frequency) by time       372 ms <

 Memory estimate: 307.06 MiB, allocs estimate: 18968.

This is roughly half of the time than running the code twice together (see my benchmark estimates above). So it makes sense.

Sad news—my code is slower than I thought! I guess I will have to make it faster. :sweat_smile:

1 Like