Well, I knew SIMD.jl
wasnβt unrolled because your code wasnβt unrolling.
And I knew that LLVM always unrolls simple loops like these 4x on top of vectorization, and LoopVectorization will do the same.
But you can check with @code_
or Cthulhu.@descend
.
I actually noticed another problem with the SIMD.jl code.
I can show the LLVM instead if youβd prefer, but Iβm showing the assembly because it is much less verbose; I added comments on what each line is doing:
L64:
mov rdx, qword ptr [rbx] # load the pointer to r
vpand ymm1, ymm0, ymmword ptr [rdx + 8*rcx] # ymm1 = load from `r` and `&`
vmovdqu ymmword ptr [rdx + 8*rcx], ymm1 # store to r
add rcx, 4 # add 4 to loop counter
cmp rax, rcx # compare with end
jne L64 # conditionally jump to L64
Here, you can see
- That it is unrolled just once (but it is simd).
- That it is loading the pointer to
r
from memory on each loop iteration.
Versus lv_turbo
βs loop:
L48:
vpandq zmm1, zmm0, zmmword ptr [rdx + 8*rcx] # zmm1 = load from `r` and `&`
vpandq zmm2, zmm0, zmmword ptr [rdx + 8*rcx + 64] # zmm2 = load from `r` and `&`
vpandq zmm3, zmm0, zmmword ptr [rdx + 8*rcx + 128] # zmm3 = load from `r` and `&`
vpandq zmm4, zmm0, zmmword ptr [rdx + 8*rcx + 192] # zmm4 = load from `r` and `&`
vmovdqu64 zmmword ptr [rdx + 8*rcx], zmm1 # store
vmovdqu64 zmmword ptr [rdx + 8*rcx + 64], zmm2 # store
vmovdqu64 zmmword ptr [rdx + 8*rcx + 128], zmm3 # store
vmovdqu64 zmmword ptr [rdx + 8*rcx + 192], zmm4 # store
add rcx, 32 # add 32 (width = 8, unrolled by 4)
cmp rdi, rcx # compare with end
jne L48 # conditionally jump to L48
We can look at the effect using LinuxPerf:
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(simd_vec4, 1_000_000, r, mask)
end
βββββββββββββββββββββββββββββββββββββββββββ
β cpu-cycles 4.49e+09 100.0% # 4.7 cycles per ns
β task-clock 9.66e+08 100.0% # 965.9 ms
β instructions 1.26e+10 100.0% # 2.8 insns per cycle
β branch-instructions 2.10e+09 100.0% # 16.7% of instructions
β branch-misses 1.96e+06 100.0% # 0.1% of branch instructions
β L1-dcache-load-misses 1.07e+09 100.0% # 25.6% of dcache loads
β L1-dcache-loads 4.18e+09 100.0%
β cache-misses 6.17e+04 100.0% # 26.4% of cache references
β cache-references 2.34e+05 100.0%
βββββββββββββββββββββββββββββββββββββββββββ
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(lv_turbo, 1_000_000, r, mask)
end
βββββββββββββββββββββββββββββββββββββββββββ
β cpu-cycles 3.57e+09 100.0% # 4.6 cycles per ns
β task-clock 7.81e+08 100.0% # 781.2 ms
β instructions 3.29e+09 100.0% # 0.9 insns per cycle
β branch-instructions 3.49e+08 100.0% # 10.6% of instructions
β branch-misses 1.05e+06 100.0% # 0.3% of branch instructions
β L1-dcache-load-misses 1.07e+09 100.0% # 90.5% of dcache loads
β L1-dcache-loads 1.18e+09 100.0%
β cache-misses 7.07e+04 100.0% # 25.8% of cache references
β cache-references 2.74e+05 100.0%
βββββββββββββββββββββββββββββββββββββββββββ
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(simd_vec8, 1_000_000, r, mask)
end
βββββββββββββββββββββββββββββββββββββββββββ
β cpu-cycles 3.68e+09 100.0% # 4.6 cycles per ns
β task-clock 8.07e+08 100.0% # 806.9 ms
β instructions 6.61e+09 100.0% # 1.8 insns per cycle
β branch-instructions 1.11e+09 100.0% # 16.8% of instructions
β branch-misses 1.95e+06 100.0% # 0.2% of branch instructions
β L1-dcache-load-misses 1.07e+09 100.0% # 49.1% of dcache loads
β L1-dcache-loads 2.19e+09 100.0%
β cache-misses 6.65e+04 100.0% # 25.2% of cache references
β cache-references 2.64e+05 100.0%
βββββββββββββββββββββββββββββββββββββββββββ
simd_vec8
was modified to use vectors of length 8, as my laptop supports that. The assembly is otherwise the same as simd_vec4
(just replace the 4 with an 8, and ymm with zmm).
Note how (on my laptop), simd_vec8 was nearly as fast as lv_turbo
. A million iterations took around 800ms for both.
However, lv_turbo required half the instructions to execute. The CPU was executing just 0.9 instructoins/cycle instead of 1.8 instructions/cycle; memory bound. Making the arrays smaller shows the advantage; we get a much larger than 10x speed up by decreasing the size of r
10-fold:
julia> rshort = rand(UInt64, 800);
julia> @btime simd_vec8($rshort, $mask);
42.574 ns (0 allocations: 0 bytes)
julia> @btime lv_turbo($rshort, $mask);
24.629 ns (0 allocations: 0 bytes)
A function you can use to experiment:
function simd_vec_uv(r::Vector{UInt64}, mask::UInt64, ::Val{U}, ::Val{V}) where {U,V}
simd_mask = SIMD.Vec{V,UInt64}((mask,mask,mask,mask,mask,mask,mask,mask))
lane = SIMD.VecRange{V}(0)
for i in 1:U*V:length(r)
masked = ntuple(Val(U)) do u
Base.@_inline_meta
@inbounds r[i+V*(u-1)+lane] & simd_mask
end
ntuple(Val(U)) do u
Base.@_inline_meta
@inbounds r[i+V*(u-1)+lane] = masked[u]
end
end
r
end
Note that this produces a lot of those annoying βreload the pointer to rβ:
L64:
mov rdx, qword ptr [rbx]
vpandq zmm1, zmm0, zmmword ptr [rdx + 8*rcx]
vpandq zmm2, zmm0, zmmword ptr [rdx + 8*rcx + 64]
vpandq zmm3, zmm0, zmmword ptr [rdx + 8*rcx + 128]
vpandq zmm4, zmm0, zmmword ptr [rdx + 8*rcx + 192]
vmovdqu64 zmmword ptr [rdx + 8*rcx], zmm1
mov rdx, qword ptr [rbx]
vmovdqu64 zmmword ptr [rdx + 8*rcx + 64], zmm2
mov rdx, qword ptr [rbx]
vmovdqu64 zmmword ptr [rdx + 8*rcx + 128], zmm3
mov rdx, qword ptr [rbx]
vmovdqu64 zmmword ptr [rdx + 8*rcx + 192], zmm4
add rcx, 32
cmp rax, rcx
jne L64
Which I think you could file an issue about over at SIMD.jl.
Doesnβt seem to make that much of a difference on performance, though. For the vectors of length 800, for example, I get about 28 ns for simd_vec_uv(r, mask, Val(4), Val(8));
, which would be the same thing lv_turbo
is doing on my computer.
simd_vec_uv(r, mask, Val(4), Val(4))
would probably be correct for yours.
EDIT:
Also, adding &=
support: https://github.com/JuliaSIMD/LoopVectorization.jl/commit/5f62713aef127ad474a1cd5faa51980b3503ddec