Hi,
I wanted to profile cache misses on some code of mine. I appear to be incapable of eyeballing the hardware prefetcher, and wanted to see whether some explicit prefetches improve my code. These are kinda accessible via something like:
#this is for read-data that is needed soon.
#maybe experiment with different locality values?
#prefetch on instruction cache crashes julia during compilation. meh, not needed anyway.
@inline function prefetch(address)
Base.llvmcall(("declare void @llvm.prefetch(i8* , i32 , i32 , i32 )",
"call void @llvm.prefetch(i8 * %0, i32 0, i32 0, i32 1)
ret void"), Void, Tuple{Ptr{Int8}}, convert(Ptr{Int8},address))
end
So, how do you people measure e.g. L1 misses in julia code?
Should I try static-julia and (linux-) perf?
If this is indeed the only reasonable way, should I then write my sample code with a @Base.ccallable ju_main() function, compile into shared library and write a tiny piece of C code that calls the library’s ju_main? Can I use static-julia to build with debug symbols?
PS. My code does a graph traversal; links are offsets into a fixed array. This is should be very bad for the cache. But I need proper tests before I should consider fancy memory layouts; and a couple explicit prefetches are a much lower effort fix than cache-oblivious stuff.