I have a somewhat complex inner loop (I can share if needed but I am looking for general rules here) that benchmarks at 200-800 ns with no allocations. A range makes sense, it depends on which conditionals are randomly used each iteration.
I am very happy with this, but would like to roughly estimate how much performance I may be leaving on the table.
To start, I benchmarked some very basic functions:
Simple Functions
function plusrand(x)
x + rand()
end
function plusrand2(x, y)
return (x + rand(), y + rand())
end
function plusone(x)
x + 1
end
function plusone2(x, y)
return (x + 1, y + 1)
end
ulia> @benchmark plusrand(1)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 2.980 ns β¦ 11.150 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.090 ns β GC (median): 0.00%
Time (mean Β± Ο): 3.137 ns Β± 0.193 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ βββ βββ
βββββββββββββββββββββββββββ
ββββ
ββββββββββββββββββββββ
βββββ β
2.98 ns Histogram: frequency by time 3.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark plusrand2(1, 10)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min β¦ max): 7.878 ns β¦ 20.982 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 7.938 ns β GC (median): 0.00%
Time (mean Β± Ο): 8.056 ns Β± 0.326 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
β
βββββββ
βββ
βββββββββββββββββββββββββββββββββββββββββββββββ β
7.88 ns Histogram: frequency by time 9.01 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark plusone(1)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 1.220 ns β¦ 5.530 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.240 ns β GC (median): 0.00%
Time (mean Β± Ο): 1.278 ns Β± 0.067 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.22 ns Histogram: frequency by time 1.41 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark plusone2(1, 10)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 1.240 ns β¦ 8.301 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.330 ns β GC (median): 0.00%
Time (mean Β± Ο): 1.324 ns Β± 0.120 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββ β
1.24 ns Histogram: frequency by time 1.42 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
From this I would say my loop could not possibly be faster than roughly 1 ns, but more likely over 10 ns, and most likely at least a few multiples of that. Also, looking at min/max, even the range for integer addition spans ~5 ns. So 5-10 ns appears to set a kind of noise floor.
Thus I canβt be leaving more than 20-80x in performance on the table, most likely more like 2-10x though.
Is there something like number of (eg, addition/multiplication/conditional) operations that can be used to improve this bound?
This is meant to be only a rough estimate, maybe one order of magnitude accuracy (ie 10 ns, 100 ns).