Non-temporal memory IO

I have just recently learned about non-temporal memory reading and writing (reading this), and I’m trying to experiment with it. I already have a basic merge-sort algorithm implemented using SIMD.jl. How could I replace the vload and vstore functions with a non-temporal version using intrinsics directly? Or even better, would it be simple to add this as a feature in that library? I might try to do it if it seems like a good idea.

4 Likes

Here is my first experiment with SIMD.jl, comments are appreciated. SIMD based merge sort in Julia · GitHub

I tried to pass a nontemporal argument to load and store using llvmcall, following the code from SIMD.jl as a reference. The function works, except I never get a non-temporal instruction on the machine code, and even on the LLVM code the flag vanishes. What could be the reason for that?

function vstorent(x::Vec{4,Float64}, aa::Ptr{Float64})
    Base.llvmcall("
            %ptr = inttoptr i64 %1 to <4 x double>*
            store <4 x double> %0, <4 x double>* %ptr, align 8, !nontemporal !{ i32 1 }
            ret void"
                  , Cvoid, Tuple{NTuple{4,VecElement{Float64}}, Ptr{Float64}}, x.elts, aa)
end
qq = randn(100);
@code_llvm vstorent(Vec((100.0,100.0,100.0,100.0)), pointer(qq, 1))

;  @ REPL[17]:2 within `vstorent'
define void @julia_vstorent_12380({ <4 x double> } addrspace(11)* nocapture nonnull readonly dereferenceable(32), i64) {
top:
; ┌ @ sysimg.jl:18 within `getproperty'
   %2 = getelementptr inbounds { <4 x double> }, { <4 x double> } addrspace(11)* %0, i64 0, i32 0
; └
  %3 = load <4 x double>, <4 x double> addrspace(11)* %2, align 16
  %ptr.i = inttoptr i64 %1 to <4 x double>*
  store <4 x double> %3, <4 x double>* %ptr.i, align 8
  ret void
}


This ended up working. Either rebuilding Julia, updating LLVM or just using the -C flag did it. Even created this PR Added support for non-temporal memory writing by nlw0 · Pull Request #46 · eschnett/SIMD.jl · GitHub

2 Likes

Thanks a lot for posting this! I’ve been wanting to try it for sometimes.

I tried SIMD.jl master and your benchmark and I could reproduce the speedup. But it looks like @code_llvm output still hides !nontemporal metadata? Did you figure out why? I’m wondering if it could be a bug in IR printer or something.

1 Like

Yes, I noticed the same thing: it’s on @code_native but not on @code_llvm! I have no clue, I didn’t get into how any of that works…

1 Like

I posted an issue just in case it’s a bug https://github.com/JuliaLang/julia/issues/31056

Oops. The answer was to use @code_llvm raw=true ...: https://github.com/JuliaLang/julia/issues/31056#issuecomment-463083357

1 Like