That difference is probably architecture related.
Skylake-X (and Cascadelake-X, which is almost identical) CPUs have AVX512, which doubles the size of the SIMD units, meaning each instruction can operate on 8 Float64
s in parallel, instead of 4 Float64
s like in CPUs with AVX(2).
Normally, how many clock cycles an instruction takes is independent of how wide it is.
However, on Skylake-X, the square root instruction actually takes twice as long for 8 Float64
as it does for 4 Float64
.
So basically everything except the square roots are now twice as fast. Thanks to Out of Order (OoO) execution, the CPU can finish the operations required for the next square root while the last one is finishing. Hence, with AVX512, the CPU spends almost 100% of its time calculating square roots, while everything else happens in parallel.
Without AVX512, square roots are more or less equally fast, but the everything else now takes twice as long ā and now takes longer than the square roots.
Also, testing that I see the same results on a tigerlake laptop (a newer architecture, with some differences from Skylake-X [the most striking one being that tigerlake has half the FMA performanceā¦]):
julia> using StaticArrays, BenchmarkTools
julia> function distances_fm!(d, x, X)
@inbounds @fastmath for n ā axes(X,1)
dā = zero(eltype(d))
y = X[n]
for p = 1:3
dā += (x[p] - y[p])^2
end
d[n] = sqrt(dā)
end
d
end
distances_fm! (generic function with 1 method)
julia> pg = [@SVector(rand(3)) for _ ā 1:3883];
julia> d = similar(pg, Float64); r = rand(length(d));
julia> @benchmark distances_fm!($d, $pg[1], $pg)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.545 Ī¼s (0.00% GC)
median time: 2.602 Ī¼s (0.00% GC)
mean time: 2.590 Ī¼s (0.00% GC)
maximum time: 5.024 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark $d .= Base.FastMath.sqrt_fast.($r)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.559 Ī¼s (0.00% GC)
median time: 2.561 Ī¼s (0.00% GC)
mean time: 2.568 Ī¼s (0.00% GC)
maximum time: 3.647 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> versioninfo()
Julia Version 1.6.0-DEV.1324
Commit ae789ca1df (2020-10-24 01:55 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, tigerlake)
and I again see the same thing: the square roots by themselves take just as long as all the distances.
This of course makes it a little harder to optimize for your architecture.
If you want to speed up calculating the distances specifically, you can try this
pgmat = copy(reshape(reinterpret(Float64, pg), (3, length(pg)))'); # length(pg) x 3 matrix
function distances_fm!(d, x, X::AbstractMatrix)
@inbounds Base.Cartesian.@nexprs 3 p -> x_p = x[p];
@inbounds @fastmath for n ā axes(X,1)
dā = zero(eltype(d))
Base.Cartesian.@nexprs 3 p -> dā += (x_p - X[n,p])^2
d[n] = sqrt(dā)
end
d
end
@benchmark distances_fm!($d, view($pgmat, 1, :), $pgmat)
Benchmark:
julia> @benchmark distances_fm!($d, view($pgmat, 1, :), $pgmat)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.617 Ī¼s (0.00% GC)
median time: 2.619 Ī¼s (0.00% GC)
mean time: 2.624 Ī¼s (0.00% GC)
maximum time: 4.654 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 9
Note that this requires a different memory layout than you had earlier: the vector of SVector
s is basically a 3 x N matrix, which weāve now transposed and copied into a new array (the copy is necessary to actually change the memory layout; transpose
/adjoint
/'
are lazy) that is N x 3.
The code uses SIMD instructions, which act on multiple data elements in parallel, doing the same operation on each of them. That means it is calculating the distances in parallel.
If we call the elements of your SVector
x
, y
, and z
, that requires it load many x
es, y
es, and z
es in parallel.
Loading contiguous numbers in parallel is faster. If youāre loading a bunch of numbers in parallel, you want them to be right next to each other.
That is, because you want to load
x_vec = [x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8]; # 8 per vector with AVX512
y_vec = [y_1, y_2, y_3, y_4, y_5, y_6, y_7, y_8]; # 8 per vector with AVX512
z_vec = [z_1, z_2, z_3, z_4, z_5, z_6, z_7, z_8]; # 8 per vector with AVX512
it helps if x_1
, x_2
, x_3
, etcā¦ are side by side in memory. Ditto for y
s and z
s.
They arenāt side by side in memory when you have a vector of SVector
s. Instead, your memory looks like
[x_1, y_1, z_1, x_2, y_2, z_2, x_3,...]
Itās not that big a deal because theyāre close to each other and the pattern is regular: LLVM will still load them contiguously, getting
v_1 = [x_1, y_1, z_1, x_2, y_2, z_2, x_3, y_3];
v_2 = [z_3, x_4, y_4, z_4, x_5, y_5, z_5, x_6];
v_3 = [y_6, z_6, x_7, y_7, z_7, x_8, y_8, z_8];
and then it shuffles their contents until it gets x_vec
, y_vec
, z_vec
, as above. The shuffling takes a few operations, but it is no where near as bad as if it had to gather them from more random or much further regions.
On your CPU, where that part of the code (i.e., not the square roots) is the bottleneck, cutting out the shuffles may make the difference.
It may also be the case that this shuffling is way faster on AVX512; the actual instruction it uses vpermt2pd is AVX512 only; it could even be the case that without this instruction LLVM doesnāt think SIMD is worth it and doesnāt even try. Mind showing me the results of @code_native debuginfo=:none syntax=:intel distances_fm!(d, pg[1], pg)
?
A couple other notes here:
- When it comes to changing memory layouts, you have to consider the rest of your code. Maybe the rest benefits from the vector of SVectors. If you have to pay a price to switch memory layouts, youāll have to consider if it is worth it. One reason I went into the influence of memory layouts above is to help you understand the trdeoffs and make decisions about 3xN vs Nx3 in the rest of your code.
- Whatās up with
Base.Cartesian.@nexprs
? Theres an issue where wrapping arrays in structs doesnāt get handled well in loops, so I do this to load from the view
before the loop, and then reference those scalars inside of it.
Iām curious if/by how much this version on a matrix is faster on your computer.