Interesting.
Mind showing the @code_native debuginfo=:none syntax=:intel
of both versions, or at least the main loop?
E.g., with the BitVector @code_native debuginfo=:none syntax=:intel test_avx!(BitVector(undef, length(a)), a, b)
(the function test_avx!
is defined below):
L64:
vmovupd zmm0, zmmword ptr [r9 + 8*rdx]
vcmpltpd k0, zmm0, zmmword ptr [r10 + 8*rdx]
kmovb byte ptr [rdi], k0
add rdx, 8
inc rdi
cmp rdx, rsi
jl L64
and the Bool
vector, @code_native debuginfo=:none syntax=:intel test_avx!(Vector{Bool}(undef, length(a)), a, b)
:
L64:
vmovupd zmm0, zmmword ptr [r9 + 8*rdx]
vcmpltpd k0, zmm0, zmmword ptr [rdi + 8*rdx]
vpmovm2w xmm0, k0
vpsrlw xmm0, xmm0, 15
vpmovwb qword ptr [r8 + rdx], xmm0
add rdx, 8
cmp rdx, rsi
jl L64
With AVX512, the comparison (vcmpltpd
) generates bitmasks directly. So the difference between both versions is that with bitmasks, it stores those bits directly. With Bool
s, it needs an extra two instructions (vpsrlw
and vpmovwb
) to turn the bits into vectors of bools before it can store them.
We can also compare across sizes after removing GC as a factor:
using LoopVectorization, BenchmarkTools
function test_avx!(c, a, b)
@avx for i in eachindex(c)
c[i] = a[i] < b[i]
end
c
end
# More convenient test3 and test4 definitions
test3(a, b) = test_avx!(BitVector(undef, length(a)), a, b)
test4(a, b) = test_avx!(Vector{Bool}(undef, length(a)), a, b)
for _n in 2:13
n = 2 ^ _n
@show n
a = rand(n); b = rand(n);
cbit = BitVector(undef, n)
cbyte = Vector{Bool}(undef, n)
println("BitVector:")
@btime test_avx!($cbit, $a, $b)
println("Byte Vector:")
@btime test_avx!($cbyte, $a, $b)
println("")
end
I get:
n = 4
BitVector:
9.732 ns (0 allocations: 0 bytes)
Byte Vector:
11.343 ns (0 allocations: 0 bytes)
n = 8
BitVector:
9.377 ns (0 allocations: 0 bytes)
Byte Vector:
10.713 ns (0 allocations: 0 bytes)
n = 16
BitVector:
10.953 ns (0 allocations: 0 bytes)
Byte Vector:
12.574 ns (0 allocations: 0 bytes)
n = 32
BitVector:
12.291 ns (0 allocations: 0 bytes)
Byte Vector:
13.964 ns (0 allocations: 0 bytes)
n = 64
BitVector:
14.946 ns (0 allocations: 0 bytes)
Byte Vector:
15.982 ns (0 allocations: 0 bytes)
n = 128
BitVector:
18.746 ns (0 allocations: 0 bytes)
Byte Vector:
26.987 ns (0 allocations: 0 bytes)
n = 256
BitVector:
21.882 ns (0 allocations: 0 bytes)
Byte Vector:
32.635 ns (0 allocations: 0 bytes)
n = 512
BitVector:
40.731 ns (0 allocations: 0 bytes)
Byte Vector:
58.382 ns (0 allocations: 0 bytes)
n = 1024
BitVector:
73.779 ns (0 allocations: 0 bytes)
Byte Vector:
106.884 ns (0 allocations: 0 bytes)
n = 2048
BitVector:
143.350 ns (0 allocations: 0 bytes)
Byte Vector:
223.396 ns (0 allocations: 0 bytes)
n = 4096
BitVector:
385.217 ns (0 allocations: 0 bytes)
Byte Vector:
452.274 ns (0 allocations: 0 bytes)
n = 8192
BitVector:
784.822 ns (0 allocations: 0 bytes)
Byte Vector:
920.618 ns (0 allocations: 0 bytes)
Looking a little more in depth, in part because LoopVectorization is choosing not to unroll this loop and I’m curious about whether or not that helps:
function test_avx_unroll4!(c, a, b)
@avx unroll=4 for i in eachindex(c)
c[i] = a[i] < b[i]
end
c
end
N = 1024;
cbit = BitVector(undef, N);
cbyte = Vector{Bool}(undef, N);
a = rand(N); b = rand(N);
using LinuxPerf
foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
Now, running the following twice:
@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbit, a, b)
@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbyte, a, b)
@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbit, a, b)
@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbyte, a, b)
I get the following on the second run; BitVector without extra unrolling:
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbit, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 4.34e+09 60.1% # 4.3 cycles per ns
┌ instructions 1.44e+10 60.1% # 3.3 insns per cycle
│ branch-instructions 2.19e+09 60.1% # 15.2% of instructions
└ branch-misses 1.00e+07 60.1% # 0.5% of branch instructions
┌ task-clock 1.01e+09 100.0% # 1.0 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 2.29e+05 19.9% # 0.0% of dcache loads
│ L1-dcache-loads 4.27e+09 19.9%
└ L1-icache-load-misses 3.44e+04 19.9%
┌ dTLB-load-misses 1.12e+04 19.9% # 0.0% of dTLB loads
└ dTLB-loads 4.27e+09 19.9%
┌ iTLB-load-misses 3.30e+04 40.0% # 2676.4% of iTLB loads
└ iTLB-loads 1.23e+03 40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Here, we see that it ran at a clock speed of 4.3 GHz (4.3 cycles per ns == 4.3 billion cycles / s), and executed 3.3 instructions per clock cycle (IPC).
15.2% of those instructions were branches, and 0.5% of branches were mispredicted (real world branch prediction rate will probably be lower than when we’re just running the same code over and over again).
Our main loop was 7 instructions, of which 1 was a branch: do we repeat the loop, or exit?
julia> 1/7
0.14285714285714285
So we have a higher branch density overall than our main loop implies.
At this size, we had 0% L1 cache misses, so memory bottle necking isn’t a problem. Now, looking at using Vector{Bool}
instead:
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbyte, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 5.62e+09 60.0% # 4.3 cycles per ns
┌ instructions 1.55e+10 60.0% # 2.8 insns per cycle
│ branch-instructions 2.16e+09 60.0% # 13.9% of instructions
└ branch-misses 1.01e+07 60.0% # 0.5% of branch instructions
┌ task-clock 1.31e+09 100.0% # 1.3 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 3.32e+05 20.0% # 0.0% of dcache loads
│ L1-dcache-loads 4.18e+09 20.0%
└ L1-icache-load-misses 4.46e+04 20.0%
┌ dTLB-load-misses 3.30e+03 19.9% # 0.0% of dTLB loads
└ dTLB-loads 4.18e+09 19.9%
┌ iTLB-load-misses 7.84e+04 39.9% # 6023.1% of iTLB loads
└ iTLB-loads 1.30e+03 39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Note that our main loop in this version had 8 instructions, one of which was a branch (the comparison → store equired 2 extra instructions, but we had one less loop counter here – eliminating the extra loop counter in the BitVector
version is something for me to loop into. I’ll file a LoopVectorization issue).
So given the density of branches in our loop is now 12.5%, it’s natural for us to have a slightly lower %branch instructions.
We still have a 0% L1 cache miss-rate, yet instructions per cycle decreased from 3.3 to 2.8. That means on top of this version requiring more instructions (1.55e10
vs 1.44e10
), the relative difference in number of cycles needed was even greater.
Now lets look at the impact of unrolling the BitVector
version:
julia> @pstats "cpu-cycles,(instructions,,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbit, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.72e+09 60.0% # 4.3 cycles per ns
┌ instructions 1.06e+10 60.0% # 2.8 insns per cycle
│ branch-instructions 1.23e+09 60.0% # 11.6% of instructions
└ branch-misses 1.01e+07 60.0% # 0.8% of branch instructions
┌ task-clock 8.68e+08 100.0% # 868.5 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 2.43e+05 20.0% # 0.0% of dcache loads
│ L1-dcache-loads 4.28e+09 20.0%
└ L1-icache-load-misses 4.11e+04 20.0%
┌ dTLB-load-misses 5.14e+02 20.0% # 0.0% of dTLB loads
└ dTLB-loads 4.28e+09 20.0%
┌ iTLB-load-misses 8.32e+04 40.0% # 7892.9% of iTLB loads
└ iTLB-loads 1.05e+03 40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The instructions per cycle was 2.8, lower than the non-unrolled version, but the number of instructions required decreased by a much a larger factor, resulting in fewer cycles and time needed overall.
The Vector{Bool}
:
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbyte, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 5.58e+09 60.0% # 4.3 cycles per ns
┌ instructions 1.27e+10 60.0% # 2.3 insns per cycle
│ branch-instructions 1.20e+09 60.0% # 9.5% of instructions
└ branch-misses 1.01e+07 60.0% # 0.8% of branch instructions
┌ task-clock 1.30e+09 100.0% # 1.3 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 3.33e+05 20.0% # 0.0% of dcache loads
│ L1-dcache-loads 4.19e+09 20.0%
└ L1-icache-load-misses 3.81e+04 20.0%
┌ dTLB-load-misses 1.27e+03 20.0% # 0.0% of dTLB loads
└ dTLB-loads 4.19e+09 20.0%
┌ iTLB-load-misses 1.46e+05 40.0% # 11649.5% of iTLB loads
└ iTLB-loads 1.25e+03 40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Even though we again have 0.0% of dcache loads
being misses, we again halve about 0.5 fewer instructions per clock than the BitVector
version, and this is about the same fast as the Vector{Bool}
version without unrolling, slower than either BitVector
version.
I’m not sure what explains the difference in instructions per clock in the BitVector
and Vector{Bool}
versions.