LinuxPerf.jl wraps the perf_event_open Linux syscall. But using it for small functions gives ridiculous results. The following example reports over 12000 clock cycles and 2000 memory fetches to compute 1+1:
using LinuxPerf
@measure 1+1
Dumb question: what is running other than the execution of 1+1 for perf_event_open to report so many events? Expression parsing?
First, run @measure twice, and in a function: f() = @measure 1+1; f(); f(). It does not run the expression multiple times and average over runs like BenchmarkTools.@btime does, so the first time you run this, you’re getting compilation time. And running it at the global scope includes some penalty from running code at the toplevel.
Second, this function is far too small to be effectively measured by the perf subsystem. You’re basically doing a syscall, performing 1-2 instructions, and then immediately performing another syscall. Performing each of those syscalls requires a non-negligible number of instructions (which perf will count), both in Julia and in the kernel. You could try running this in a repeated loop for some number of iterations, but you’d still be picking up loop overhead (at least 2 extra instructions) after calculating the average result.
function foreachf(f::F, N, args::Vararg{Any,A}) where {F,A}
foreach(_ -> f(args...), 1:N)
end
So that it calls f(args...) a total of N times.
However, you’ll have to make sure the compiler doesn’t defeat the benchmark, like it does for +(::Int,::Int).
using LinuxPerf
bench = make_bench([LinuxPerf.EventType(:hw, :cycles)])
function f(bench, x)
enable!(bench)
x = x+1
disable!(bench)
x
end
f(bench, x)
reset!(bench)
f(bench, x)
@show counters(bench)
I get 96 cycles from the above. So probably you just need to change the default bench (which defaults to reasonable_defaults, which are actually a lot of metrics the kernel needs to collect and process) and make sure to only measure within a function.