There aren’t any branches other than whether or not we’re getting another iteration of the loop
L144:
vmovupd zmm2, zmmword ptr [rcx + 8*rax]
vmovupd zmm3, zmmword ptr [rcx + 8*rax + 64]
vmovupd zmm4, zmmword ptr [rcx + 8*rax + 128]
vmovupd zmm5, zmmword ptr [rcx + 8*rax + 192]
vmovupd zmm6, zmmword ptr [rdx + 8*rax]
vmovupd zmm7, zmmword ptr [rdx + 8*rax + 64]
vmovupd zmm8, zmmword ptr [rdx + 8*rax + 128]
vmovupd zmm9, zmmword ptr [rdx + 8*rax + 192]
vsubpd zmm10, zmm2, zmm6
vsubpd zmm11, zmm3, zmm7
vsubpd zmm12, zmm4, zmm8
vsubpd zmm13, zmm5, zmm9
vpcmpgtq k1, zmm10, zmm1
vpcmpgtq k2, zmm11, zmm1
vpcmpgtq k3, zmm12, zmm1
vpcmpgtq k4, zmm13, zmm1
vcmpordpd k1 {k1}, zmm2, zmm0
vcmpordpd k2 {k2}, zmm3, zmm0
vcmpordpd k3 {k3}, zmm4, zmm0
vcmpordpd k4 {k4}, zmm5, zmm0
vmovapd zmm2 {k1}, zmm6
vmovapd zmm3 {k2}, zmm7
vmovapd zmm4 {k3}, zmm8
vmovapd zmm5 {k4}, zmm9
vmovupd zmmword ptr [rsi + 8*rax], zmm2
vmovupd zmmword ptr [rsi + 8*rax + 64], zmm3
vmovupd zmmword ptr [rsi + 8*rax + 128], zmm4
vmovupd zmmword ptr [rsi + 8*rax + 192], zmm5
add rax, 32
cmp rdi, rax
jne L144
Benchmarks with an AVX512 machine:
julia> @benchmark bench!(c, a, b, min) setup=(rand!(a); rand!(b))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.140 ÎĽs (0.00% GC)
median time: 2.183 ÎĽs (0.00% GC)
mean time: 2.186 ÎĽs (0.00% GC)
maximum time: 7.222 ÎĽs (0.00% GC)
--------------
samples: 100000
evals/sample: 9
julia> @benchmark bench!(c, a, b, min_) setup=(rand!(a); rand!(b))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.783 ÎĽs (0.00% GC)
median time: 1.799 ÎĽs (0.00% GC)
mean time: 1.802 ÎĽs (0.00% GC)
maximum time: 2.930 ÎĽs (0.00% GC)
--------------
samples: 100000
evals/sample: 10
With smaller arrays, where we aren’t starved on memory, we see a much larger difference. Using a length of 1024
instead:
julia> @benchmark bench!($c, $a, $b, min) setup=(rand!($a); rand!($b))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 175.345 ns (0.00% GC)
median time: 176.199 ns (0.00% GC)
mean time: 176.495 ns (0.00% GC)
maximum time: 216.782 ns (0.00% GC)
--------------
samples: 38994
evals/sample: 719
julia> @benchmark bench!($c, $a, $b, min_) setup=(rand!($a); rand!($b))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 86.452 ns (0.00% GC)
median time: 89.048 ns (0.00% GC)
mean time: 89.098 ns (0.00% GC)
maximum time: 130.425 ns (0.00% GC)
--------------
samples: 57633
evals/sample: 958