I don’t think that’s the case, at least on recent x64 hardware.
Aligned loads/stores are faster, but aligned move instructions aren’t (move instructions are used for loading/storing). It’s just that one crashes if unaligned.
So the benefit of promising alignment isn’t performance. The benefit is free runtime checks (free from a performance perspective) that the memory really is aligned. You’ll be notified by a segfault if you’re wrong, rather than silently having worse performance.
EDIT:
Some compilers, like gcc, will often generate alignment checks + some code to align if unaligned in front of loops. So std::assume_aligned
could let it skip these checks, which would have a performance benefit.
Also:
using VectorizationBase: assume
function mydot_aligned(x,y)
s = zero(promote_type(eltype(x),eltype(y)))
assume((reinterpret(UInt, pointer(x)) % (64 % UInt)) == zero(UInt))
assume((reinterpret(UInt, pointer(y)) % (64 % UInt)) == zero(UInt))
@inbounds @simd for i in eachindex(x,y)
s += x[i]*y[i]
end
s
end
produces this SIMD loop (@code_native
):
L176:
vmovapd zmm4, zmmword ptr [rax + 8*rsi]
vmovapd zmm5, zmmword ptr [rax + 8*rsi + 64]
vmovapd zmm6, zmmword ptr [rax + 8*rsi + 128]
vmovapd zmm7, zmmword ptr [rax + 8*rsi + 192]
vfmadd231pd zmm0, zmm4, zmmword ptr [rcx + 8*rsi] # zmm0 = (zmm4 * mem) + zmm0
vfmadd231pd zmm1, zmm5, zmmword ptr [rcx + 8*rsi + 64] # zmm1 = (zmm5 * mem) + zmm1
vfmadd231pd zmm2, zmm6, zmmword ptr [rcx + 8*rsi + 128] # zmm2 = (zmm6 * mem) + zmm2
vfmadd231pd zmm3, zmm7, zmmword ptr [rcx + 8*rsi + 192] # zmm3 = (zmm7 * mem) + zmm3
add rsi, 32
cmp rdx, rsi
jne L176
Notice the vmovapd
s instead of vmovupd
s.
So this does work to tell LLVM about alignment.
EDIT:
Maybe I’m wrong:
julia> x = rand(256);
julia> y = rand(256);
julia> @btime mydot($x,$y)
10.032 ns (0 allocations: 0 bytes)
67.11501240811893
julia> @btime mydot_aligned($x,$y)
8.534 ns (0 allocations: 0 bytes)
67.11501240811893
julia> @btime mydot($x,$y)
10.031 ns (0 allocations: 0 bytes)
67.11501240811893
julia> @btime mydot_aligned($x,$y)
8.530 ns (0 allocations: 0 bytes)
67.11501240811893
mydot
is the same, except I commented out the assume
s.
EDIT:
restarted Julia:
julia> x = rand(256);
julia> y = rand(256);
julia> @btime mydot($x,$y)
11.590 ns (0 allocations: 0 bytes)
63.81066585474556
julia> @btime mydot_aligned($x,$y)
11.885 ns (0 allocations: 0 bytes)
63.81066585474556
julia> @btime mydot($x,$y)
11.589 ns (0 allocations: 0 bytes)
63.81066585474556
julia> @btime mydot_aligned($x,$y)
11.960 ns (0 allocations: 0 bytes)
63.81066585474556
Was probably just noise. Sometimes functions are just randomly faster or slower for no discernible (by me) reason in a manner that is consistent within a Julia session, but not between Julia sessions/recompilations.