Unrolled loop results in Improbable @benchmark timing

The last timing is probably not accurate. The compiler is really good at propagating things lately, which make proper benchmarks incredibly hard. Your case is solvable by letting the @benchmark macro initialise the input locally. This avoids the lookup time for global variables and doesn’t get constant-propped away (at least not on my machine):

julia> @benchmark co(A, B, UInt64(1), UInt32(0), UInt32(0)) setup = begin A = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...); B = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...) end
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     54.813 ns (0.00% GC)
  median time:      54.914 ns (0.00% GC)
  mean time:        56.826 ns (0.00% GC)
  maximum time:     165.957 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark co2(A, B, UInt64(1), UInt32(0), UInt32(0)) setup = begin A = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...); B = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...) end
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.064 ns (0.00% GC)
  median time:      51.165 ns (0.00% GC)
  mean time:        52.313 ns (0.00% GC)
  maximum time:     140.932 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987