It isnβt just about memory
(This requires ArrayInterface >= 3.1.30 and LoopVectorization >= 0.12.69, which as of writing this are the latest versions [for things like properly propagating static size information of the reinterpreted array to LV].)
Benchmarks of SVector{3,Float64}
vs SVector{4,Float64}
:
julia> @benchmark forces!($x3,$f3,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
Range (min β¦ max): 41.045 ns β¦ 69.736 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 44.280 ns β GC (median): 0.00%
Time (mean Β± Ο): 44.333 ns Β± 0.524 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ βββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
41 ns Histogram: log(frequency) by time 45.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark forces_turbo!($x3,$f3,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
Range (min β¦ max): 40.308 ns β¦ 62.864 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 40.906 ns β GC (median): 0.00%
Time (mean Β± Ο): 40.992 ns Β± 0.532 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ ββ ββ
βββββββββββββββ
ββββββββ
βββ
βββββββββββββββββββββββββββββββββ β
40.3 ns Histogram: frequency by time 42.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark forces!($x4,$f4,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min β¦ max): 28.011 ns β¦ 48.468 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 28.552 ns β GC (median): 0.00%
Time (mean Β± Ο): 28.554 ns Β± 0.454 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ ββ βββ
ββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
28 ns Histogram: frequency by time 29.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark forces_turbo!($x4,$f4,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min β¦ max): 27.562 ns β¦ 52.618 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 27.753 ns β GC (median): 0.00%
Time (mean Β± Ο): 27.768 ns Β± 0.405 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
27.6 ns Histogram: frequency by time 29 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Code
using StaticArrays, FastPow, BenchmarkTools, LoopVectorization
function forces_turbo!(x::Vector{SVector{N,T}},f::Vector{SVector{N,T}},d,idxs) where {N,T}
X = reinterpret(reshape, T, x)
F = reinterpret(reshape, T, f)
forces_turbo!(X,F,d,idxs)
return f
end
@inline function forces_turbo!(X,F,d,idxs)
@turbo for id in axes(idxs,1)
i = idxs[id,1]
j = idxs[id,2]
for k in axes(X,1)
r = X[k,j] - X[k,i]
dudr = -12*(1/d^7 - 1/d^4)*r
F[k,i] = F[k,i] + dudr
F[k,j] = F[k,j] - dudr
end
end
end
function forces!(x,f,d,idxs)
@inbounds for id in axes(idxs,1)
i = idxs[id,1]
j = idxs[id,2]
r = x[j] - x[i]
@fastpow dudr = -12*(1/d^7 - 1/d^4)*r
f[i] = f[i] + dudr
f[j] = f[j] - dudr
end
return f
end
x3 = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
f3 = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
x4 = [ rand(SVector{4,Float64}) for _ in 1:1000 ];
f4 = [ zeros(SVector{4,Float64}) for _ in 1:1000 ];
d = rand();
idxs = rand(1:1000,20,2);
f3_2 = copy(f3); f4_2 = copy(f4);
forces!(x3,f3,d,idxs) β forces_turbo!(x3,f3_2,d,idxs)
forces!(x4,f4,d,idxs) β forces_turbo!(x4,f4_2,d,idxs)
@benchmark forces!($x3,$f3,$d,$idxs)
@benchmark forces_turbo!($x3,$f3,$d,$idxs)
@benchmark forces!($x4,$f4,$d,$idxs)
@benchmark forces_turbo!($x4,$f4,$d,$idxs)
Here is the asm of the @turbo
loop with SVector{3,Float64}
.
Here is the asm of the @turbo
loop with SVector{4,Float64}
.
The above benchmarks were done on a Cascadelake CPU, so you can click the Cascadelake box (under Microarchitecture), make sure uiCA is checked among the tools (it is the most accurate), and then click Run!
to get an estimated throughput (average number of clock cycles per iteration completed when doing many iterations under certain memory assumptions) and analysis of the assembly and port use.
You can use @code_native syntax=:intel debuginfo=:none
to get the assembly of the loops for your architecture to take a look as well.
My original point here was to say that LoopVectorization can SIMD 192 byte vectors just fine.
However, the estimated throughput is lower. All the masked operations require 2 ports instead of 1.
However, the performance difference we see here is much larger than the estimated 5 vs 4.28 cycles for this architecture.
I suspect that difference comes from many of the loads and stores in the SVector{3,Float64}
case not being aligned, crossing 64 byte boundaries. Loads and stores crossing such a boundary count double.
Thus, the memory accesses are much faster in the SVector{4,Float64}
case than in the SVector{3,Float64}
case.
Note that in both cases, memory accesses dominated the cost of the loop. E.g., in the SVector{4,Float64}
case, the two ports experiencing the heaviest load were ports 2 and 3, both of which only handled memory.
In the SVector{3,Float64}
case, port 5 was hit hardest, and also restricted entirely to memory related operations.
Note that uiCA assumes all memory is in the L1 cache; by saying this is memory dominated, I mean the CPUβs execution resources in the best case scenario are mostly spent rearranging memory (which is pretty normal).
EDIT:
You could also speed it up slightly by first calculating dinv = inv(d)
, and then using dinv^4
and dinv^7
.