lmiq
August 10, 2021, 2:44pm
1
What is correct here? I guess is the one with evals=1
. In that case, couldnβt it be the default option?
julia> using BenchmarkTools
julia> @btime sin(5.0)
1.538 ns (0 allocations: 0 bytes)
-0.9589242746631385
julia> x = 5.0
5.0
julia> @btime sin($x)
7.530 ns (0 allocations: 0 bytes)
-0.9589242746631385
julia> @btime sin($x) evals=1
40.000 ns (0 allocations: 0 bytes)
-0.9589242746631385
jling
August 10, 2021, 2:47pm
2
@btime
gives you the minimal time, so of course eval=1
will make a difference
When things can be constant propagated, I usually Ref
wrap the input like:
julia> using BenchmarkTools
julia> @btime sin(5.0) # bogus result from const prop
0.013 ns (0 allocations: 0 bytes)
-0.9589242746631385
julia> x = 5.0
5.0
julia> @btime sin($x) # bogus result from const prop
0.013 ns (0 allocations: 0 bytes)
-0.9589242746631385
julia> @btime sin($(Ref(x))[]) # ok
6.476 ns (0 allocations: 0 bytes)
-0.9589242746631385
3 Likes
lmiq
August 10, 2021, 2:51pm
4
I donβt think is simply that. With evals=1
it is running many samples of 1 evaluation each. Without it, it is running many samples of many evaluations (at least is what I understand from the manual). My understanding is that when more than one evaluation per sample is being run, we are getting some artifact associated to caching results.
julia> @benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min β¦ max): 7.539 ns β¦ 42.284 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 8.425 ns β GC (median): 0.00%
Time (mean Β± Ο): 9.240 ns Β± 1.905 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β β
ββββ
ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββ β
7.54 ns Histogram: frequency by time 15.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sin($x) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 41.000 ns β¦ 542.000 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 45.000 ns β GC (median): 0.00%
Time (mean Β± Ο): 45.989 ns Β± 5.730 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β
βββββ
βββββββββββββββββββββββββββββββββββ
βββββ
βββββββββββββββ β
41 ns Histogram: frequency by time 53 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
@kristoffer.carlsson I think you still have constant propagation in your Ref
example, but within samples. Might be wrong though. What you get if you use evals=1
? (your timings are very different from mine here).
Here I get:
julia> @btime sin($(Ref(x))[])
7.778 ns (0 allocations: 0 bytes)
-0.9589242746631385
julia> @btime sin($(Ref(x))[]) evals=1
40.000 ns (0 allocations: 0 bytes)
-0.9589242746631385
I use nightly Julia which probably has better constant propagation.
2 Likes
lmiq
August 10, 2021, 3:00pm
6
That is good news for Julia, but not so for BenchmarkTools
, which becomes harder to understand.
I still donβt understand the difference in the results. It is even strange that, for example, using evals=1
one gets:
julia> @btime sin($x) evals=1
40.000 ns (0 allocations: 0 bytes)
-0.9589242746631385
and with evals=2
one gets:
julia> @btime sin($x) evals=2
24.500 ns (0 allocations: 0 bytes)
-0.9589242746631385
And by increasing evals
one converges to about 7 ns
which is (?) the correct benchmark (?).
These are completely systematic, thus these results do not seem to be associated with random fluctuations of the benchmark.
I checked the code a few months ago and it seemed that the runtime of a single time_ns()
call is always added to the measuerement, resulting in a 25ns/evals error on my machine.
3 Likes
lmiq
August 10, 2021, 3:04pm
8
Ah, that is one reason for the problem. Ok. So that systematic error is diluted when one uses many evaluations in each sample.
That mixed with the constant propagation thing makes makes benchmarking a little bit confusing. Perhaps there is room for improvement in the API?
Iβve found that I get more robust results by just broadcasting over an input vector:
julia> @btime sin.(x) setup=(x=rand(1000));
6.083 ΞΌs (1 allocation: 7.94 KiB)
Itβs less ergonomic, but it does a good job guarding against overly-aggressive constant propagation.
4 Likes
This is also good for functions with branches. For example, sin will be faster for numbers less than pi/4.
3 Likes
Elrod
August 10, 2021, 9:06pm
11
Plus, it makes the SIMD implementations look good.
I normally Ref
-wrap any isbits
structs Iβm benchmarking.
40ns is way too long. Thatβd be well over 100 clock cycles for most CPUs.
lmiq
August 10, 2021, 9:09pm
12
Couldnβt that be automatic?
1 Like
Elrod
August 10, 2021, 9:11pm
13
Itβs of course possible that the compiler will get smart enough to defeat this, too, eventually.
But Iβd be in favor of it ref-wrapping everything by default.
The macro doesnβt have access to type information, but it could probably generate code thatβs the equivalent of
rx = Ref(x)
isbits(x) ? x : rx[]
Or maybe just ref-wrap everything by default.
2 Likes
jzr
August 10, 2021, 9:48pm
14