Is it possible to prefetch memory in Julia?

The title says it all! The only references I can find online are 5 years old and don’t work on Julia v1

How does the title say it all? I don’t understand the title, maybe you can clarify?

2 Likes

It’s an optimization technique: you tell the CPU a few cycles ahead of time that you are going to access a region of memory so that it can fetch it into cache before it’s needed.

Wouldn’t it be a such low level task that it should be handled by the compiler?

2 Likes

Could probably inject a prefetch using LLVM.jl or LLVM-cbe.jl or julia/llvmcall.jl

Are you trying to inject an lfetch instruction, or some of the SIMD intrinsics? If the latter, you could look at SIMDIntrinsics.jl and see if they are supported.

Just found your question some months later. Did you find an answer / were you able to prefetch, in julia?

No I didn’t. It looks like you’d probably need to write the right llvmcall but that scares me and I don’t know what instructions are needed. I’m still interested in the answer though.

No I didn’t. It looks like you’d probably need to write the right llvmcall but that scares me and I don’t know what instructions are needed. I’m still interested in the answer though.

Thx for the quick response!

I can see, how the llvm- stuff is scary, I feel the same way. Came across that package, also, in my search and am postponing that, for now, but it should really be useful to all the people, writing packages for big-data-stuff, where prefetching can give large performance-boosts, whenever cache-misses are an issue (which is basically the case for anyone, dealing with data >32…64 KByte on regular hardware).

Wouldn’t it be a such low level task that it should be handled by the compiler?

Prefetches may be handled automatically by hardware (the cpu), even, IF cache-prediction has a chance to predict accurately, which data will be accessed, some few dozen cycles in advance and prefetch-instructions can also be generated by the compiler, automatically, but I doubt, that compiler-generated prefetches are generated outside of simple loops within a pretty limited scope.

So it really depends on what kind of operations we are doing, for whether or not, prefetches happen automatically. I for one have a use-case, where I am certain, that no compiler, let alone some hardware is able to predict whether or not (and which data!) I’m going to access, next. It’s a specific datastructure, which most of the time (99%+) certainly won’t be in L1 and likely not even in L2 or L3-cache. At the same time, this memory-access (to a hashtable), I have in mind, is “just” a read-compare(-store) operation, for (potentially) saving some piece of data for retrieval at a later time (possibly by another thread, even).

So it is not time-critical, as such: Delaying the read-compare(-store) for a few dozen cycles (and doing something else, meanwhile) is fine, as long as other stuff is getting done and nothing is stalled - which is just the ideal use-case for an early prefetch of that data - doing something else - and then coming back to the read-compare(-store) for getting that done quickly, when everything has been set up in L1-cache, meanwhile.

Edit: Think of it like this: It’s 11am, you’re already hungry, but have a lot of stuff to do & know you will have only 20 Mins. for a lunch-break, starting around 12.30h. The prefetch is your phone, which you can use to order your favorite pizza ready to be picked up at 12.35h at the restaurant on the corner (= L1-cache)…

Having to fall back beyond even L3-cache is comparable to your phone being discharged, finding out, (during your lunchbreak, only) the corner-place is closed anyways, and then having to take the subway to some other restaurant, for the food you want, at the other side of town. :sweat_smile:

VectorizationBase.jl provides a few functions to do memory prefetching, but AFAICT they’re not part of the public API.

2 Likes

provides a few functions to do memory prefetching, but AFAICT they’re not part of the public API.

Thanks! Your post motivated me, to look into the code, again. I had seen it only a few days ago, but postponed exploring it.

I’m not entirely sure, how it is exactly supposed to work, neither on a formal nor on a semantical level, but also, it doesn’t look as scary, as I remembered. :sweat_smile:

Semantically, the only thing that seems totally clear is: ptr has gotta be some kind of pointer to the data to be prefetched. I’m assuming, that ptr should be initialized like so in julia…
(code: https://github.com/JuliaLang/julia/blob/36034abf26062acad4af9dcec7c4fc53b260dbb4/base/pointer.jl#L132-L143)

# x = ...some instance of a julia-type, instantiated at some point, prior...
ptr = pointer_from_objref(x)  %Ptr{Cvoid}

So far, so good (it seems).

Then there are 2 more parameters to be passed…

::Val{L} = # 0, 1, 2 or 3 : "locality"
::Val{R} = # 0 or 1       : "read / write"

…here, the formal part seems to be easy: Just 2 primitve values, wrapped in a `Val.
But semantics are unclear, but I’m tempted to guesstimate something like:

# locality...
0 = prefetch to L1-cache (and all other levels)
1 =             L2-
2 =             L3-
3 = least clear, but I know there is something like NMTA, which is yet another facility, having to do with caching data
# read / write
0 = read.
1 = write. I seem to remember there is something like an "intent to write" - flag or "request of ownership", 
both related to shared memory by multiple cores / threads, if I remember correctly, though I'm not entirely
sure about why that would be needed, even, as synchronization isn't guaranteed, regardless of cache-hits and
-misses, unless it is enforced by locks, etc.

…a lot of uncertainty remains, even if I should have guessed some stuff, correctly.
Also the question remains, how to even test - I think I’ll (at least) sleep over this, before I might give it a try with some test(s).

Here is LLVM’s documentation for prefetch.

address is the address to be prefetched, rw is the specifier determining if the fetch should be for a read (0) or write (1), and locality is a temporal locality specifier ranging from (0) - no locality, to (3) - extremely local keep in cache. The cache type specifies whether the prefetch is performed on the data (1) or instruction (0) cache. The rw , locality and cache type arguments must be constant integers.

Feel free to make a PR adding documentation to VectorizationBase. =)

Note that LoopVectorization.jl actually inserts prefetches, so while not documented, lots of people are probably using it. (BLAS of course also uses it)
AccurateArithmetic.jl also uses it.

I suggest experimenting with locality as well as prefetch distance (i.e., how long you prefetch memory before actually using it).

4 Likes

In case an example is useful, here is the part of AccurateArithmetic.jl where memory prefetching is performed. In the context, $(Prefetch*W) is the prefetch distance: in a given iteration, the algorithm works with the data located at px+offset, but prefetches the data stored at px+offset+$(Prefetch*W).

3 Likes

Thanks everyone, this is great.