I would like to introduce PAPI.jl, Julia bindings for the PAPI hardware performance counters library.
Performance Application Programming Interface (PAPI) is a portable and efficient API to access
the performance counters available on modern processors. It attempts to provide a unified set of performance metrics across vendors and platforms. Besides performance counters, it includes components to measure power consumption, software-defined events, task granularity, etc
PAPI includes a component for the linux performance counters, so all those counters are also accessible from PAPI.jl
PAPI.jl adds high-level functions on top providing useful primitives that developers can easily understand and use. As well as, allowing access to some of the low-level primitives from Julia.
The basic interface is provided by two macros ‘profile’ and ‘sample’: both taking a set of events and a function call. ‘profile’ gives a one-shot profile whereas ‘sample’ offers a more detailed insight with multiple samples (useful for statistical testing).
Example usage:
using PAPI
function mysum(X::AbstractArray)
s = zero(eltype(X))
for x in X
s += x
end
s
end
X = rand(10_000)
stats = @profile mysum(X)
which by default outputs a profile about number of total and branching instructions
BR_INS = 4875055 # 22.0% of all instructions # 588.0 M/sec
BR_MSP = 188115 # 4.0% of all branches
TOT_INS = 22197305 # 0.839 insn per cycle
TOT_CYC = 26453432 # 3.191 Ghz # 1.192 cycles per insn
runtime = 8289881 nsecs
A set of events can also be provided, for example giving insight into vectorization
julia> stats = @profile [PAPI.VEC_DP, PAPI.DP_OPS] mysum(X)
EventValues:
VEC_DP = 1000000
DP_OPS = 1000000 # 1.0x vectorized
runtime = 1424735 nsecs
I use it extensively in my high-performance work and hope others might find it useful as well.
This kind of profiling however does not give a complete picture and care needs to be taken when interpreting performance counters.