I have just recently learned about non-temporal memory reading and writing (reading this), and I’m trying to experiment with it. I already have a basic merge-sort algorithm implemented using SIMD.jl. How could I replace the vload
and vstore
functions with a non-temporal version using intrinsics directly? Or even better, would it be simple to add this as a feature in that library? I might try to do it if it seems like a good idea.
Here is my first experiment with SIMD.jl, comments are appreciated. SIMD based merge sort in Julia · GitHub
I tried to pass a nontemporal argument to load
and store
using llvmcall
, following the code from SIMD.jl as a reference. The function works, except I never get a non-temporal instruction on the machine code, and even on the LLVM code the flag vanishes. What could be the reason for that?
function vstorent(x::Vec{4,Float64}, aa::Ptr{Float64})
Base.llvmcall("
%ptr = inttoptr i64 %1 to <4 x double>*
store <4 x double> %0, <4 x double>* %ptr, align 8, !nontemporal !{ i32 1 }
ret void"
, Cvoid, Tuple{NTuple{4,VecElement{Float64}}, Ptr{Float64}}, x.elts, aa)
end
qq = randn(100);
@code_llvm vstorent(Vec((100.0,100.0,100.0,100.0)), pointer(qq, 1))
; @ REPL[17]:2 within `vstorent'
define void @julia_vstorent_12380({ <4 x double> } addrspace(11)* nocapture nonnull readonly dereferenceable(32), i64) {
top:
; ┌ @ sysimg.jl:18 within `getproperty'
%2 = getelementptr inbounds { <4 x double> }, { <4 x double> } addrspace(11)* %0, i64 0, i32 0
; └
%3 = load <4 x double>, <4 x double> addrspace(11)* %2, align 16
%ptr.i = inttoptr i64 %1 to <4 x double>*
store <4 x double> %3, <4 x double>* %ptr.i, align 8
ret void
}
This ended up working. Either rebuilding Julia, updating LLVM or just using the -C flag did it. Even created this PR Added support for non-temporal memory writing by nlw0 · Pull Request #46 · eschnett/SIMD.jl · GitHub
Thanks a lot for posting this! I’ve been wanting to try it for sometimes.
I tried SIMD.jl master and your benchmark and I could reproduce the speedup. But it looks like @code_llvm
output still hides !nontemporal
metadata? Did you figure out why? I’m wondering if it could be a bug in IR printer or something.
Yes, I noticed the same thing: it’s on @code_native but not on @code_llvm! I have no clue, I didn’t get into how any of that works…
I posted an issue just in case it’s a bug https://github.com/JuliaLang/julia/issues/31056
Oops. The answer was to use @code_llvm raw=true ...
: https://github.com/JuliaLang/julia/issues/31056#issuecomment-463083357