Thank you for bringing this up. Getting the right answer is very important. “BenchmarkTools.jl gets the right answer here and Chairmarks does not” is a powerful bug report.
I define the “runtime” of a microbenchmark based on “if I swap f
for g
in a larger code, how will the runtime change”. We can establish a ground truth for whether f
or g
is faster by actually embedding them in a larger code.
function macro_benchmark_1(f, n)
sum = 0
for i in 1:n
shift = hash(i)
for j in 1:n
sum += f(j, shift)
end
end
sum
end
function macro_benchmark_2(f, n)
sum = 0
for i in 1:n
val = hash(i)
for j in 1:n
sum += f(val, j)
end
end
sum
end
function macro_benchmark_3(f, n, m)
sum = 0
for i in 1:n
for j in 1:m
sum += f(i, j)
end
end
sum
end
function macro_benchmark_4(f, A)
# https://discourse.julialang.org/t/how-to-shift-bits-faster/19405
sum = 0
@inbounds @simd for k = 1:length(A)
sum += f(k, A[k])
end
sum
end
f(x, n) = x << n
g(x, n) = x << (n & 63)
using BenchmarkTools, Chairmarks
# Microbenchmarks
x = UInt128(1); n = 1;
@btime f($x, $n); # 2.500 ns (0 allocations: 0 bytes)
@btime g($x, $n); # 1.958 ns (0 allocations: 0 bytes)
@b f($x, $n) # 1.136 ns
@b g($x, $n) # 1.135 ns
# Macrobenchmarks
@time macro_benchmark_1(f, 100_000); # 0.000104 seconds (1.04 ns per outer iteration)
@time macro_benchmark_1(g, 100_000); # 1.457513 seconds (0.146 ns/iter)
@time macro_benchmark_2(f, 30_000); # 0.762815 seconds (0.848 ns/iter)
@time macro_benchmark_2(g, 30_000); # 0.759171 seconds (0.844 ns/iter)
@time macro_benchmark_3(f, 10_000_000, 100); # 0.339392 seconds (0.339 ns/iter)
@time macro_benchmark_3(g, 10_000_000, 100); # 0.327428 seconds (0.327 ns/iter)
A = rand(0:63, 100_000_000);
@time macro_benchmark_4(f, A); # 0.046893 seconds (0.469 ns/iteration)
@time macro_benchmark_4(g, A); # 0.020474 seconds (0.205 ns/iteration, slightly less than 1 clock cycle)
Julia Version 1.11.0-alpha1
Commit 671de9f5793 (2024-03-01 08:30 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (aarch64-linux-gnu)
CPU: 8 × unknown # Asahi Linux on Mac M2 (3.5 GHz)
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
These macrobenchmarks indicate to me that f
and g
are too small and the compiler integrates them into their surrounding code too fully for them to be viable candidates for microbenchmarking. I don’t know what the “right answer” is for microbenchmark results.