Nice trick of manually unrolling by x16 to generate efficient code (on AVX2)!
FWIW, I was interested in UnitRange vs StepRange because I have a library called LoopVectorization.jl, which vectorizes, and optionally unrolls, loops.
The idea is that it’s a blunt, brute-force instrument.
(LoopVectorization also depends on the unregistered VectorizationBase, SIMDPirates, and the SIMDPirates fork of SLEEF.jl – I’ll eventually make a new repo called SLEEFPirates to avoid having to git fetch origin && git checkout -b SIMDPirates remotes/origin/SIMDPirates
).
Sometimes, no matter how much you try and argue with the autovectorizer, it may refuse to cooperate. Especially when pointers are involved. For example:
function axpy!(z, a, x, y)
@inbounds @simd ivdep for i ∈ eachindex(z, x, y)
z[i] = a * x[i] + y[i]
end
end
using StaticArrays, BenchmarkTools
function randmvector(::Val{N}, T = Float64) where N
# Compiles quickly, regardless of size of MVector
# unlike @MVector rand(N)
mv = MVector{N,T}(undef)
@inbounds for i ∈ 1:N
mv[i] = rand()
end
mv
end
x = randmvector(Val(512));
y = randmvector(Val(512));
z = MVector{512,Float64}(undef);
@code_native axpy!(z, 2.0, x, y)
This yields:
julia> @code_native axpy!(z, 2.0, x, y)
.text
; ┌ @ REPL[1]:2 within `axpy!'
movq $-4096, %rax # imm = 0xF000
nopw (%rax,%rax)
; │┌ @ simdloop.jl:73 within `macro expansion' @ REPL[1]:3
; ││┌ @ float.jl:399 within `*'
L16:
vmulsd 4096(%rsi,%rax), %xmm0, %xmm1
; ││└
; ││ @ simdloop.jl:73 within `macro expansion' @ float.jl:395
vaddsd 4096(%rdx,%rax), %xmm1, %xmm1
; ││ @ simdloop.jl:73 within `macro expansion' @ REPL[1]:3
; ││┌ @ MArray.jl:130 within `setindex!'
; │││┌ @ gcutils.jl:87 within `macro expansion'
; ││││┌ @ pointer.jl:118 within `unsafe_store!'
vmovsd %xmm1, 4096(%rdi,%rax)
; │└└└└
; │┌ @ int.jl:49 within `macro expansion'
addq $8, %rax
; └└
; ┌ @ simdloop.jl:71 within `axpy!'
jne L16
; └
; ┌ @ REPL[1]:2 within `axpy!'
retq
nopw %cs:(%rax,%rax)
; └
The loop is just:
L16:
vmulsd 4096(%rsi,%rax), %xmm0, %xmm1
vaddsd 4096(%rdx,%rax), %xmm1, %xmm1
vmovsd %xmm1, 4096(%rdi,%rax)
addq $8, %rax
jne L16
Note that vmulsd
and vaddsd
multiply and add single doubles (hence, sd
instead of pd
suffix).
Yikes!
It’s naturally very slow:
julia> @btime axpy!($z, 2.0, $x, $y)
228.304 ns (0 allocations: 0 bytes)
While
using LoopVectorization
function axpy2!(z, a, x, y)
@vectorize for i ∈ eachindex(z, x, y)
z[i] = a * x[i] + y[i]
end
end
@code_native axpy2!(z, 2.0, x, y)
Yields:
L16:
vmovupd 4096(%rsi,%rax), %zmm1
vfmadd213pd 4096(%rdx,%rax), %zmm0, %zmm1
vmovupd %zmm1, 4096(%rdi,%rax)
addq $64, %rax
jne L16
I also did add an unroll
argument:
function axpy3!(z, a, x, y)
@vectorize 2 for i ∈ eachindex(z, x, y)
z[i] = a * x[i] + y[i]
end
end
function axpy4!(z, a, x, y)
@vectorize 4 for i ∈ eachindex(z, x, y)
z[i] = a * x[i] + y[i]
end
end
@code_native axpy3!(z, 2.0, x, y)
@code_native axpy4!(z, 2.0, x, y)
Note that it currently gets rounded down to the nearest power of 2.
axpy3
looks like expected:
L16:
vmovupd 4096(%rsi,%rax), %zmm1
vmovupd 4160(%rsi,%rax), %zmm2
vfmadd213pd 4096(%rdx,%rax), %zmm0, %zmm1
vfmadd213pd 4160(%rdx,%rax), %zmm0, %zmm2
vmovupd %zmm2, 4160(%rdi,%rax)
vmovupd %zmm1, 4096(%rdi,%rax)
addq $128, %rax
jne L16
while for axpy4, we unrolled the loop by enough (only 16 iterations left) that llvm actually decided to completely unroll the loop.
Results:
julia> @btime axpy2!($z, 2.0, $x, $y)
23.885 ns (0 allocations: 0 bytes)
julia> @btime axpy3!($z, 2.0, $x, $y)
23.793 ns (0 allocations: 0 bytes)
julia> @btime axpy4!($z, 2.0, $x, $y)
24.066 ns (0 allocations: 0 bytes)
Much better!
Of course, my actual motivation was more along the lines of:
using SLEEF
function special!(z, a, x, y)
@inbounds for i ∈ eachindex(z, x, y)
w = a * x[i] + y[i]
s, c = sincos(w)
z[i] = exp(s + log(c*c))
end
end
function vectorized_special!(z, a, x, y)
@vectorize for i ∈ eachindex(z, x, y)
w = a * x[i] + y[i]
s, c = sincos(w)
z[i] = exp(s + log(c*c))
end
end
@btime special!($z, 2.0, $x, $y)
@btime vectorized_special!($z, 2.0, $x, $y)
Yielding an almost 8x speedup:
julia> @btime special!($z, 2.0, $x, $y)
15.081 μs (0 allocations: 0 bytes)
julia> @btime vectorized_special!($z, 2.0, $x, $y)
1.955 μs (0 allocations: 0 bytes)