Using LinuxPerf for small functions

LinuxPerf.jl wraps the perf_event_open Linux syscall. But using it for small functions gives ridiculous results. The following example reports over 12000 clock cycles and 2000 memory fetches to compute 1+1:

using LinuxPerf
@measure 1+1

Dumb question: what is running other than the execution of 1+1 for perf_event_open to report so many events? Expression parsing?

@vchuravy :point_up:

First, run @measure twice, and in a function: f() = @measure 1+1; f(); f(). It does not run the expression multiple times and average over runs like BenchmarkTools.@btime does, so the first time you run this, you’re getting compilation time. And running it at the global scope includes some penalty from running code at the toplevel.

Second, this function is far too small to be effectively measured by the perf subsystem. You’re basically doing a syscall, performing 1-2 instructions, and then immediately performing another syscall. Performing each of those syscalls requires a non-negligible number of instructions (which perf will count), both in Julia and in the kernel. You could try running this in a repeated loop for some number of iterations, but you’d still be picking up loop overhead (at least 2 extra instructions) after calculating the average result.

2 Likes

Running @measurement multiple times still gives me thousands of cycles (6000+, though quite variable across runs) and 2270 instructions.

Doing an equivalent benchmark in C gives me ~60 cycles and 13 instructions.

I always define

function foreachf(f::F, N, args::Vararg{Any,A}) where {F,A}
    foreach(_ -> f(args...), 1:N)
end

So that it calls f(args...) a total of N times.
However, you’ll have to make sure the compiler doesn’t defeat the benchmark, like it does for +(::Int,::Int).

1 Like

Can you show the code for this C benchmark?

Sure.

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                   group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
    int x = 1+1;
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("Used %lld cyles\n", count);

    close(fd);
    return x-2;
}
gcc perftest.c -o perftest; chmod +x perftest; ./perftest
using LinuxPerf

bench = make_bench([LinuxPerf.EventType(:hw, :cycles)])
function f(bench, x)
  enable!(bench)
  x = x+1
  disable!(bench)
  x
end
f(bench, x)
reset!(bench)
f(bench, x)
@show counters(bench)

I get 96 cycles from the above. So probably you just need to change the default bench (which defaults to reasonable_defaults, which are actually a lot of metrics the kernel needs to collect and process) and make sure to only measure within a function.

3 Likes