There aren’t any branches other than whether or not we’re getting another iteration of the loop
L144:
        vmovupd zmm2, zmmword ptr [rcx + 8*rax]
        vmovupd zmm3, zmmword ptr [rcx + 8*rax + 64]
        vmovupd zmm4, zmmword ptr [rcx + 8*rax + 128]
        vmovupd zmm5, zmmword ptr [rcx + 8*rax + 192]
        vmovupd zmm6, zmmword ptr [rdx + 8*rax]
        vmovupd zmm7, zmmword ptr [rdx + 8*rax + 64]
        vmovupd zmm8, zmmword ptr [rdx + 8*rax + 128]
        vmovupd zmm9, zmmword ptr [rdx + 8*rax + 192]
        vsubpd  zmm10, zmm2, zmm6
        vsubpd  zmm11, zmm3, zmm7
        vsubpd  zmm12, zmm4, zmm8
        vsubpd  zmm13, zmm5, zmm9
        vpcmpgtq        k1, zmm10, zmm1
        vpcmpgtq        k2, zmm11, zmm1
        vpcmpgtq        k3, zmm12, zmm1
        vpcmpgtq        k4, zmm13, zmm1
        vcmpordpd       k1 {k1}, zmm2, zmm0
        vcmpordpd       k2 {k2}, zmm3, zmm0
        vcmpordpd       k3 {k3}, zmm4, zmm0
        vcmpordpd       k4 {k4}, zmm5, zmm0
        vmovapd zmm2 {k1}, zmm6
        vmovapd zmm3 {k2}, zmm7
        vmovapd zmm4 {k3}, zmm8
        vmovapd zmm5 {k4}, zmm9
        vmovupd zmmword ptr [rsi + 8*rax], zmm2
        vmovupd zmmword ptr [rsi + 8*rax + 64], zmm3
        vmovupd zmmword ptr [rsi + 8*rax + 128], zmm4
        vmovupd zmmword ptr [rsi + 8*rax + 192], zmm5
        add     rax, 32
        cmp     rdi, rax
        jne     L144
Benchmarks with an AVX512 machine:
julia> @benchmark bench!(c, a, b, min)   setup=(rand!(a); rand!(b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.140 ÎĽs (0.00% GC)
  median time:      2.183 ÎĽs (0.00% GC)
  mean time:        2.186 ÎĽs (0.00% GC)
  maximum time:     7.222 ÎĽs (0.00% GC)
  --------------
  samples:          100000
  evals/sample:     9
julia> @benchmark bench!(c, a, b, min_)   setup=(rand!(a); rand!(b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.783 ÎĽs (0.00% GC)
  median time:      1.799 ÎĽs (0.00% GC)
  mean time:        1.802 ÎĽs (0.00% GC)
  maximum time:     2.930 ÎĽs (0.00% GC)
  --------------
  samples:          100000
  evals/sample:     10
With smaller arrays, where we aren’t starved on memory, we see a much larger difference. Using a length of 1024 instead:
julia> @benchmark bench!($c, $a, $b, min)   setup=(rand!($a); rand!($b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     175.345 ns (0.00% GC)
  median time:      176.199 ns (0.00% GC)
  mean time:        176.495 ns (0.00% GC)
  maximum time:     216.782 ns (0.00% GC)
  --------------
  samples:          38994
  evals/sample:     719
julia> @benchmark bench!($c, $a, $b, min_)   setup=(rand!($a); rand!($b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     86.452 ns (0.00% GC)
  median time:      89.048 ns (0.00% GC)
  mean time:        89.098 ns (0.00% GC)
  maximum time:     130.425 ns (0.00% GC)
  --------------
  samples:          57633
  evals/sample:     958