Is it possible to prefetch memory in Julia?

cjdoris · July 30, 2021, 7:11pm

The title says it all! The only references I can find online are 5 years old and don’t work on Julia v1

DNF · July 30, 2021, 7:38pm

How does the title say it all? I don’t understand the title, maybe you can clarify?

cjdoris · July 30, 2021, 7:56pm

It’s an optimization technique: you tell the CPU a few cycles ahead of time that you are going to access a region of memory so that it can fetch it into cache before it’s needed.

Ronis_BR · July 30, 2021, 8:46pm

Wouldn’t it be a such low level task that it should be handled by the compiler?

blackeneth · July 30, 2021, 8:47pm

Could probably inject a prefetch using LLVM.jl or LLVM-cbe.jl or julia/llvmcall.jl

Are you trying to inject an lfetch instruction, or some of the SIMD intrinsics? If the latter, you could look at SIMDIntrinsics.jl and see if they are supported.

fins · October 12, 2022, 7:13pm

Just found your question some months later. Did you find an answer / were you able to prefetch, in julia?

cjdoris · October 12, 2022, 8:15pm

No I didn’t. It looks like you’d probably need to write the right llvmcall but that scares me and I don’t know what instructions are needed. I’m still interested in the answer though.

fins · October 12, 2022, 10:10pm

No I didn’t. It looks like you’d probably need to write the right llvmcall but that scares me and I don’t know what instructions are needed. I’m still interested in the answer though.

Thx for the quick response!

I can see, how the llvm- stuff is scary, I feel the same way. Came across that package, also, in my search and am postponing that, for now, but it should really be useful to all the people, writing packages for big-data-stuff, where prefetching can give large performance-boosts, whenever cache-misses are an issue (which is basically the case for anyone, dealing with data >32…64 KByte on regular hardware).

fins · October 12, 2022, 10:38pm

Wouldn’t it be a such low level task that it should be handled by the compiler?

Prefetches may be handled automatically by hardware (the cpu), even, IF cache-prediction has a chance to predict accurately, which data will be accessed, some few dozen cycles in advance and prefetch-instructions can also be generated by the compiler, automatically, but I doubt, that compiler-generated prefetches are generated outside of simple loops within a pretty limited scope.

So it really depends on what kind of operations we are doing, for whether or not, prefetches happen automatically. I for one have a use-case, where I am certain, that no compiler, let alone some hardware is able to predict whether or not (and which data!) I’m going to access, next. It’s a specific datastructure, which most of the time (99%+) certainly won’t be in L1 and likely not even in L2 or L3-cache. At the same time, this memory-access (to a hashtable), I have in mind, is “just” a read-compare(-store) operation, for (potentially) saving some piece of data for retrieval at a later time (possibly by another thread, even).

So it is not time-critical, as such: Delaying the read-compare(-store) for a few dozen cycles (and doing something else, meanwhile) is fine, as long as other stuff is getting done and nothing is stalled - which is just the ideal use-case for an early prefetch of that data - doing something else - and then coming back to the read-compare(-store) for getting that done quickly, when everything has been set up in L1-cache, meanwhile.

Edit: Think of it like this: It’s 11am, you’re already hungry, but have a lot of stuff to do & know you will have only 20 Mins. for a lunch-break, starting around 12.30h. The prefetch is your phone, which you can use to order your favorite pizza ready to be picked up at 12.35h at the restaurant on the corner (= L1-cache)…

Having to fall back beyond even L3-cache is comparable to your phone being discharged, finding out, (during your lunchbreak, only) the corner-place is closed anyways, and then having to take the subway to some other restaurant, for the food you want, at the other side of town.

ffevotte · October 12, 2022, 10:50pm

VectorizationBase.jl provides a few functions to do memory prefetching, but AFAICT they’re not part of the public API.

github.com

JuliaSIMD/VectorizationBase.jl/blob/1859668f42b240e02b41b28ef3f14e8d9c693b87/src/llvm_intrin/memory_addr.jl#L1858-L1918


      
          @generated function prefetch(ptr::Ptr{Cvoid}, ::Val{L}, ::Val{R}) where {L,R}
            L ∈ (0, 1, 2, 3) || throw(
              ArgumentError(
                "Prefetch intrinsic requires a locality argument of 0, 1, 2, or 3, but received $L.",
              ),
            )
            R ∈ (0, 1) || throw(
              ArgumentError(
                "Prefetch intrinsic requires a read/write argument of 0, 1, but received $R.",
              ),
            )
            decl = "declare void @llvm.prefetch(i8*, i32, i32, i32)"
            instrs = """
                %addr = inttoptr $JULIAPOINTERTYPE %0 to i8*
                call void @llvm.prefetch(i8* %addr, i32 $R, i32 $L, i32 1)
                ret void
            """
            llvmcall_expr(
              decl,
              instrs,

This file has been truncated. show original

fins · October 13, 2022, 12:39am

provides a few functions to do memory prefetching, but AFAICT they’re not part of the public API.

Thanks! Your post motivated me, to look into the code, again. I had seen it only a few days ago, but postponed exploring it.

I’m not entirely sure, how it is exactly supposed to work, neither on a formal nor on a semantical level, but also, it doesn’t look as scary, as I remembered.

Semantically, the only thing that seems totally clear is: ptr has gotta be some kind of pointer to the data to be prefetched. I’m assuming, that ptr should be initialized like so in julia…
(code: https://github.com/JuliaLang/julia/blob/36034abf26062acad4af9dcec7c4fc53b260dbb4/base/pointer.jl#L132-L143)

# x = ...some instance of a julia-type, instantiated at some point, prior...
ptr = pointer_from_objref(x)  %Ptr{Cvoid}

So far, so good (it seems).

Then there are 2 more parameters to be passed…

::Val{L} = # 0, 1, 2 or 3 : "locality"
::Val{R} = # 0 or 1       : "read / write"

…here, the formal part seems to be easy: Just 2 primitve values, wrapped in a `Val.
But semantics are unclear, but I’m tempted to guesstimate something like:

# locality...
0 = prefetch to L1-cache (and all other levels)
1 =             L2-
2 =             L3-
3 = least clear, but I know there is something like NMTA, which is yet another facility, having to do with caching data

# read / write
0 = read.
1 = write. I seem to remember there is something like an "intent to write" - flag or "request of ownership", 
both related to shared memory by multiple cores / threads, if I remember correctly, though I'm not entirely
sure about why that would be needed, even, as synchronization isn't guaranteed, regardless of cache-hits and
-misses, unless it is enforced by locks, etc.

…a lot of uncertainty remains, even if I should have guessed some stuff, correctly.
Also the question remains, how to even test - I think I’ll (at least) sleep over this, before I might give it a try with some test(s).

Elrod · October 13, 2022, 3:19am

Here is LLVM’s documentation for prefetch.

address is the address to be prefetched, rw is the specifier determining if the fetch should be for a read (0) or write (1), and locality is a temporal locality specifier ranging from (0) - no locality, to (3) - extremely local keep in cache. The cache type specifies whether the prefetch is performed on the data (1) or instruction (0) cache. The rw , locality and cache type arguments must be constant integers.

Feel free to make a PR adding documentation to VectorizationBase. =)

Note that LoopVectorization.jl actually inserts prefetches, so while not documented, lots of people are probably using it. (BLAS of course also uses it)
AccurateArithmetic.jl also uses it.

I suggest experimenting with locality as well as prefetch distance (i.e., how long you prefetch memory before actually using it).

ffevotte · October 13, 2022, 6:40am

In case an example is useful, here is the part of AccurateArithmetic.jl where memory prefetching is performed. In the context, $(Prefetch*W) is the prefetch distance: in a given iteration, the algorithm works with the data located at px+offset, but prefetches the data stored at px+offset+$(Prefetch*W).

github.com

JuliaMath/AccurateArithmetic.jl/blob/master/src/Summation.jl#L49-L52


      
          if $Prefetch > 0
              # VectorizationBase.prefetch.(px.+offset.+$(Prefetch*WT), Val(3), Val(0))
              VectorizationBase.prefetch0.(px, offset + $(Prefetch*W))
          end

cjdoris · October 15, 2022, 9:54pm

Thanks everyone, this is great.

Topic		Replies	Views
Benchmarking / profiling cache use General Usage question , performance	4	1332	November 24, 2017
Can the julia compiler optimize code for the processor cache? Maker question	2	651	March 23, 2020
Random memory access performance - how to hint compiler/runtime to preload/cache/plan Performance question , performance , memory , turbo	0	262	January 16, 2023
Global constants that cannot be defined until runtime because they depend on specifics of the CPU architecture General Usage	7	743	February 1, 2021
Can Julia access Memory locations or Registers directly General Usage question	12	1160	August 26, 2021

Is it possible to prefetch memory in Julia?

Related topics