Benchmarking / profiling cache use



I wanted to profile cache misses on some code of mine. I appear to be incapable of eyeballing the hardware prefetcher, and wanted to see whether some explicit prefetches improve my code. These are kinda accessible via something like:

 #this is for read-data that is needed soon.
 #maybe experiment with different locality values?
 #prefetch on instruction cache crashes julia during compilation. meh, not needed anyway.
@inline function prefetch(address)
   Base.llvmcall(("declare void @llvm.prefetch(i8* , i32 , i32 , i32 )",
 "call void @llvm.prefetch(i8 * %0, i32 0, i32 0, i32 1)
 ret void"), Void, Tuple{Ptr{Int8}}, convert(Ptr{Int8},address)) 

So, how do you people measure e.g. L1 misses in julia code?

Should I try static-julia and (linux-) perf?

If this is indeed the only reasonable way, should I then write my sample code with a @Base.ccallable ju_main() function, compile into shared library and write a tiny piece of C code that calls the library’s ju_main? Can I use static-julia to build with debug symbols?

PS. My code does a graph traversal; links are offsets into a fixed array. This is should be very bad for the cache. But I need proper tests before I should consider fancy memory layouts; and a couple explicit prefetches are a much lower effort fix than cache-oblivious stuff.


Perhaps could be of use.


Thank you! Google completely failed me there.

This tool is absolutely fantastic; just kindly asking the kernel is obviously much better than a rube-goldberg construction to pass compiled julia into command line tools.

At some point there should be a sticky post / wiki describing these tricks. Carnaval’s IACA.jl also looks extremely interesting.


I think this is a continued version of Carnavals IACA:


This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.