Ah, so I guess the main question is about memory leaks. Could we add an option to create the closure locally?
A thought: perhaps this could be mitigated by including an estimate of the variability in the output, e.g.
154.621 ms ± 5 ms (2 allocs: 762.940 MiB, 13.63% gc time)
As a user who hasn’t studied the differences between BenchmarkTools.jl and Chairmarks.jl, is it reasonable to try both packages and trust the one reporting a shorter run time taken by the benchmarked function, assuming that the other packages has picked up more overhead?
I think they should be reporting roughly the same timings. Report if they do not.
Benchmarking results are notoriously poorly behaved from a statistical perspective, so it is unlikely that I can provide a fully robust measure of uncertainty, if I can whip up an empirical estimate that performs reasonably that would be nice, though!
In @XLin0mu’s case, those benchmarks are the result of a single trial, perhaps we should indicate that even in the @b
display?
I would expect @btime
to generally report lower numbers whenever GC is involved, since it runs GC.gc()
3-4 times before each benchmark to try to avoid running GC passes during the benchmark itself. But that doesn’t necessarily mean it’s showing more realistic numbers, if your code is allocating and you expect to need to pay for GC in your actual application.
I found Chairmarks to be more prone to report subnanoseccond timings for trivial functions (any function which returns a constant or almost), BenchmarkTools is slightly more conservative, but at that level it’s also hard to assess what’s the “truth”.
What functions are you using? Here are some functions I ensure are giving nonzero results in CI
julia> @b rand
2.791 ns
julia> @b rand hash
2.704 ns
julia> @b 1+1
1.200 ns
julia> @b rand _^1
2.099 ns
julia> @b rand _^2
2.175 ns
julia> @b rand _^3
2.100 ns
julia> @b rand _^4
5.110 ns
julia> @b rand _^5
5.100 ns
julia> @b 0
2.101 ns
julia> @b 1
2.098 ns
julia> @b -10923740
2.175 ns
julia> versioninfo()
Julia Version 1.11.0-alpha1
Commit 671de9f5793 (2024-03-01 08:30 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (aarch64-linux-gnu)
CPU: 8 × unknown
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
JULIA_EDITOR = code
julia> import Pkg; Pkg.TOML.parsefile(joinpath(pkgdir(Chairmarks), "Project.toml"))["version"]
"1.1.2"
(The fact that @b 2
is slower than @b 1+1
is due to Don't de-optimize `@b lit` where `lit` is a literal constant. by LilithHafner · Pull Request #21 · LilithHafner/Chairmarks.jl · GitHub)
Ok, this is complicated, I can only reproduce in a Pluto notebook and only on specific hardware so far (znver3):
I was happy to use Chairmarks.jl in this notebook as apparently most of the time was spent in some
@btime
cells, switching to Chairmarks.jl cut down sensibly the time to rerun the whole notebook from scratch, but I also noticed that constant functions report subnanosecond timings.
What I want most out of a Julia benchmarking package is an attempt at a consistent world-view on the context in which the results are valid. I want a high-level core philosophy on what @bench expr
should mean. Or put slightly differently, I want to have a firm understanding of when I’d see the performance estimated by the macro. (Or even more specifically: Where can I plop the given expr
to see that runtime? What parts of the expr
need to be constant? Which parts are function arguments? Which parts are literals?)
To be concrete: what should the result be for @bench 1+1
? I honestly don’t know anymore — not for either package! But this seems like the sort of thing that should be fundamentally answerable at a very high level. And, really, either result is defensible. I’m ok with getting zeros back! And I’m ok with some nonzero floor.
Chairmarks is some fancy UI and instrumentation wrapping execution that looks like this
function benchmark(f, evals, samples)
minimum(@elapsed(for _ in 1:evals
Base.donotdelete(@noinline f())
end) for _ in 1:samples)/evals
end
Or even more specifically: Where can I plop the given expr to see that runtime?
benchmark(() -> <plop your expr here>, evals, samples)
For example,
julia> function benchmark(f, evals, samples)
minimum(@elapsed(for _ in 1:evals
Base.donotdelete(@noinline f())
end) for _ in 1:samples) / evals
end
benchmark (generic function with 1 method)
julia> @be 1+1
Benchmark: 6119 samples with 11246 evaluations
min 1.134 ns
median 1.249 ns
mean 1.337 ns
max 16.184 ns
julia> benchmark(() -> 1+1, 11246, 6119)
1.133736439622977e-9
If you provide more arguments to @b
, they are handled according to the docstring, making a slightly more complex execution context.
See the “Evaluation Model” section of the docstring of @be
for a sketch of the evaluation model with more arguments.
Semantically, it would be valid for this to report 0 or nonzero. Base.donotdelete
may be re-ordered with respect to other side effects (e.g. the starting and stopping of the timer). It also only guarantees that the value it is passed (2, in the case of @b 1+1
) appears “somewhere in memory” evals number of times. According to my read of that, it would be perfectly valid to start the timer, stop the timer, set a register to 2 once, note that there are eval instants that elapse in the following clock cycle, and return a runtime of 0.
It would also be valid for the compiler to be less “clever” and report the function call overhead instead (I think this is what is currently happening)
But I want to stay at a higher level than that. I don’t really care how any given package implements its timing loop, I want to know the goal it’s striving for and what its limitations are. I want to know in which real-world contexts 1+1
takes the 1.2 ns it reports — or for what values it may not be representative.
There’s all sorts of wild non-local effects and lots of implicit assumptions:
- Should the passed expressions behave as though they were sitting inside a compiled function? (yes)
- Should the passed expressions behave as though they were sitting inside a locally-visible
for
loop? (typically no; I think everyone tries to avoid hoisting unless you explicitly write a for loop… but it’s a choice!) - How hard does it try to specialize? Should the passed expression follow Julia’s specialization rules? Or should it force specialization for all arguments? What’s counted as an argument?
- How hard does it try to prevent constant propagation? Do I need to manually circumvent it with
$(Ref(x))[]
s? In what cases do I need to worry about this?
So for example there are three classes of performance of sin
:
julia> x = 0.5; @b sin(x) # type unstable
16.859 ns (1.00 allocs: 16.099 bytes)
julia> x = 0.5; @b sin($x) # type stable, prevents constant propagation
4.016 ns
julia> @b sin(0.5) # allows constant propagation
1.234 ns
So now which of the following allows or prevents constant propagation? At a higher level: Do benchmarking expressions try to reflect the exact status of the names it’s using? Or is there another interpretation? What’s the goal? Should any of these permit constant propagation? Which ones?
const y=0.5; @b sin(y)
const y=0.5; @b sin($y)
f() = (z=0.5; @b sin(z)); f()
g() = (z=0.5; @b sin($z)); g()
x=0.5; @b x sin
const y=0.5; @b y sin
@b 0.5 sin
Or even more interesting is with specialization:
julia> @b rand(Float64)
3.032 ns
julia> @b rand($Float64) # What _should_ this mean?
135.413 ns (1.02 allocs: 16.700 bytes)
julia> @b Float64 rand # or this?
133.030 ns (1.02 allocs: 16.675 bytes)
julia> f(::Type{T}) where {T} = @b T rand; f(Float64)
132.402 ns (1.04 allocs: 17.047 bytes)
julia> g(::Type{T}) where {T} = @b rand(T); g(Float64)
3.033 ns
julia> h(::Type{T}) where {T} = (S = T; @b rand(S)); h(Float64)
134.917 ns (1.02 allocs: 16.667 bytes)
The are good questions, @mbauman. Thank you!
In benchmarking and in real world contexts, the compiler knows some things and emits code that does some things. You should get similar results in the real world as in benchmarks if the compiler knows the same things, emits code that does the same things, and the hardware execution context is similar in benchmarks and your real code.
The hardware execution context of Chairmarks is: run your provided expression as a noinline’d function repeatedly with a small amount 0-10% of other unspecified code interspersed.
The “thing the code is doing” in the benchmark is a call to a function that does what your expr does. However, this only makes sense if you know what the compiler knows about your expr at compile time. At compile time…
The compiler knows the types of all arguments, but knows none of their values. The thing passed from one part of the pipeline to the next and referred to by _ is an argument, as are things passed in with $. Nothing else is an argument.
For any symbol other than those two types of explicit argument, the compiler knows exactly as much about that symbol as it would have if you used the symbol in a function declared where you invoked @b. Note that captured locals currently get stored in closures as values and their types are a part of the type of the closure. Captured locals are treated the similarly to explicit arguments in this way.
The compiler does not know that the function is being executed repeatedly on the same arguments in a loop (though this is totally a lie and as the compiler gets smarter it may be hard to keep it up)
What the compiler chooses to do with this information is unspecified, but hopefully it will make good choices and give the fastest results it can with the information it has.
I don’t think I need to make any special comments about how specialization is handled.
The reason I chose not to tell the compiler about the values of interpolated expressions is that that requires permanently allocating memory for each new value that is interpolated. If you interpolate with @eval @b f($x)
then the compiler will know the value of x.
For example,
const y=0.5; @b sin(y) # Constant prop; Implicit argument which the compiler knows is constant
const y=0.5; @b sin($y) # No constant prop; Explicit argument
f() = (z=0.5; @b sin(z)); f() # No constant prop; Implicit argument which the compiler does not know is constant because it discards that information when forming a closure
g() = (z=0.5; @b sin($z)); g() # No constant prop; Explicit argument
x=0.5; @b x sin # No constant prop; Explicit argument
const y=0.5; @b y sin # No constant prop; Explicit argument
@b 0.5 sin # No constant prop; Explicit argument
# Explicit arguments are always type stable and never constant prop
@b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin # Type stable but no constprop
# You can get type instability if you really want to using implicit arguments
# but I bet the runtime here will depend on all sorts of strange factors
x = 0.5; @b sin(x) # Non-constant global -> dynamic dispatch
h() = (z = rand((1, 0.0, 1+5im, big(5), big(1.0))); @b sin(z)); h() # Compiler performs dynamic dispatch on function construction, making the type of `z` known by the compiler within the hypothetical closure -> no dynamic dispatch
# Chairmarks will never tell the compiler about a value it would not otherwise be able to find out about
i(x) = @b sin(x); i(rand()) # No constant prop
i(x) = @b sin(x); j() = i(0.5); j() # Will not constant prop because the compiler forgets the value of `x` when forming a closure.
k(x) = @eval @b sin($x); k(0.5) # Constant props. Equivalent to @b sin(0.5) at global scope.
Specialization examples (Though I’m able to explain the expected behavior of these without mentioning specialization)
@b rand(Float64)
The compiler knows both the type and value of rand
and Float64
so this should report the runtime of the rand
method for ::Type{Float64}
.
@b rand($Float64)
@b Float64 rand
f(::Type{T}) where {T} = @b T rand; f(Float64)
In all of these cases, Float64
is passed as an explicit “argument” so the compiler knows it’s type (DataType
) but not value (Float64
) so it is impossible for the compiler to know at compile time that the rand(::Type{Float64})
method will be called. Hence the reported runtime should include the time taken to dispatch to that method (which, unless the authors of rand optimized for this case, will likely include dynamic dispatch)
g(::Type{T}) where {T} = @b rand(T); g(Float64)
This is an implicit argument, so the compiler knows exactly as much about T as it would in a function declared in the context where @b was invoked. It happens that type variables are handled differently than most captured locals so the closure returned by g(::Type{T}) where {T} = () -> rand(T)
actually captures the value Float64
rather than the type DataType
in its type. Thus, the compiler knows that T is equal to Float64
at compile time and we should expect to measure the runtime of @b rand(Float64)
.
h(::Type{T}) where {T} = (S = T; @b rand(S)); h(Float64)
Again, with an implicit argument we have to look at what the compiler would know about S were it used in a function declaration at the location of @b: h(::Type{T}) where {T} = (S = T; () -> rand(S))
. S is just an ordinary captured local, so it’s type (DataType
) is known and it’s value (Float64
) is not, resulting in dynamic dispatch. This can be verified with
julia> g(::Type{T}) where {T} = () -> rand(T); typeof(g(Float64))
var"#143#144"{Float64}
julia> h(::Type{T}) where {T} = (S = T; () -> rand(S)); typeof(h(Float64))
var"#145#146"{DataType}
Note that while Charimarks can never guarantee that constant prop, compile-time dispatch, or any other optimization will happen, they should hopefully happen in benchmarks in the same ways and contexts as they happen outside of benchmarks.
This happens to be a semantic that Chairmarks already follows. But I’m open to changing it before documenting it as a part of the public API.
There is fundamentally one (and a half) key question in my wall of text above:
- What (non-)local information (e.g.,
const
-ness, type, specialization, etc) about the names in the benchmarked expression is stripped? Or retained?- How does
$
change that? What’s the goal of a$
flag?
- How does
Yes! This is exactly the answer I’m looking for! And then it sounds like the goal of a $
is to treat the value as an argument to a function (without any explicit specialization), trying to hide the fact that the value is a constant from the compiler. Does this mean that we shouldn’t need to manually use the $(Ref(x))[]
pattern over a simple $x
— and if it’s necessary, is it a bug?
These cases seem like odd ones out:
f() = (z=0.5; @b sin(z)); f() # No constant prop; Implicit argument which the compiler does not know is constant because it discards that information when forming a closure
h(::Type{T}) where {T} = (S = T; @b rand(S)); h(Float64)
In practice, sin(z)
and rand(S)
will constant-propagate or be type-stable in those respective contexts. I really don’t want to need to think about Julia’s closure-capture rules if I didn’t explicitly write the closure (and especially not if they’re at conflict with the overall goal above).
I really don’t understand what’s happening here:
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
16.304 ns
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
0 ns
It’s also worth noting that BenchmarkTools gives pretty different answers here and I don’t fully appreciate why, either:
julia> @b rand($Float64)
135.854 ns (1.02 allocs: 16.583 bytes)
julia> @btime rand($Float64);
79.287 ns (1 allocation: 16 bytes)
@Lilith that list of examples is great, thanks! Might be good to put in a visible place in the documentation…
It’s nice to be able to toggle constant propagation so easily with interpolation!
const y=0.5; @b sin(y) # Constant prop; Implicit argument which the compiler knows is constant
const y=0.5; @b sin($y) # No constant prop; Explicit argument
Is there a typo here? It’s twice the same line right? and I do get the same result up to precision:
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
3.700 ns
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
2.976 ns
Nope, that’s why I really don’t understand it
rand((1, 0.0, 1+5im, big(5), big(1.0)))
is the setup function whose value is passed to sin
, so depending upon your seed you’ll be testing different types. It might be reproducible with a seed?
julia> using Random, Chairmarks
julia> Random.seed!(0)
TaskLocalRNG()
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
2.796 ns
julia> Random.seed!(1)
TaskLocalRNG()
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
0 ns
I can’t reproduce the 0 ns
but I do see something strange: the timing in the second case is often an integer number of ns
:
julia> using Random, Chairmarks
julia> Random.seed!(0);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
3.235 μs (14.00 allocs: 432.115 bytes)
julia> Random.seed!(0);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
3.296 ns
julia> Random.seed!(1);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
9.000 ns
julia> Random.seed!(1);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
5.000 ns
(Linux, Julia 1.10.2, Chairmarks 1.1.2)
I do get zeros (same Julia/Chairmarks versions, on Windows):
julia> using Random, Chairmarks
julia> Random.seed!(0);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
900.000 ns
julia> Random.seed!(0);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
2.411 ns
julia> Random.seed!(1);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
0 ns
julia> Random.seed!(1);
julia> @b rand((1, 0.0, 1+5im, big(5), big(1.0))) sin
0 ns