Using LinuxPerf for small functions

shmiggles · May 2, 2021, 5:03am

LinuxPerf.jl wraps the perf_event_open Linux syscall. But using it for small functions gives ridiculous results. The following example reports over 12000 clock cycles and 2000 memory fetches to compute 1+1:

using LinuxPerf
@measure 1+1

Dumb question: what is running other than the execution of 1+1 for perf_event_open to report so many events? Expression parsing?

carstenbauer · May 2, 2021, 6:52am

@vchuravy

jpsamaroo · May 2, 2021, 12:02pm

First, run @measure twice, and in a function: f() = @measure 1+1; f(); f(). It does not run the expression multiple times and average over runs like BenchmarkTools.@btime does, so the first time you run this, you’re getting compilation time. And running it at the global scope includes some penalty from running code at the toplevel.

Second, this function is far too small to be effectively measured by the perf subsystem. You’re basically doing a syscall, performing 1-2 instructions, and then immediately performing another syscall. Performing each of those syscalls requires a non-negligible number of instructions (which perf will count), both in Julia and in the kernel. You could try running this in a repeated loop for some number of iterations, but you’d still be picking up loop overhead (at least 2 extra instructions) after calculating the average result.

shmiggles · May 6, 2021, 4:32am

Running @measurement multiple times still gives me thousands of cycles (6000+, though quite variable across runs) and 2270 instructions.

Doing an equivalent benchmark in C gives me ~60 cycles and 13 instructions.

Elrod · May 6, 2021, 5:01am

I always define

function foreachf(f::F, N, args::Vararg{Any,A}) where {F,A}
    foreach(_ -> f(args...), 1:N)
end

So that it calls f(args...) a total of N times.
However, you’ll have to make sure the compiler doesn’t defeat the benchmark, like it does for +(::Int,::Int).

jpsamaroo · May 6, 2021, 4:31pm

Can you show the code for this C benchmark?

shmiggles · May 14, 2021, 4:33am

Sure.

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                   group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
    int x = 1+1;
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("Used %lld cyles\n", count);

    close(fd);
    return x-2;
}

gcc perftest.c -o perftest; chmod +x perftest; ./perftest

jpsamaroo · May 15, 2021, 2:39pm

using LinuxPerf

bench = make_bench([LinuxPerf.EventType(:hw, :cycles)])
function f(bench, x)
  enable!(bench)
  x = x+1
  disable!(bench)
  x
end
f(bench, x)
reset!(bench)
f(bench, x)
@show counters(bench)

I get 96 cycles from the above. So probably you just need to change the default bench (which defaults to reasonable_defaults, which are actually a lot of metrics the kernel needs to collect and process) and make sure to only measure within a function.

Topic		Replies	Views
How to measure cache misses in Julia? Performance	11	3563	February 18, 2019
CPU cycles and syscalls Performance performance	4	661	February 12, 2019
Using perf top with non-stdlib julia code Performance	3	251	August 5, 2022
@time is off by at least 10x New to Julia question	7	800	August 13, 2020
Benchmark parts of a function? New to Julia question , benchmark	3	505	June 30, 2021

Using LinuxPerf for small functions

Related topics