The last timing is probably not accurate. The compiler is really good at propagating things lately, which make proper benchmarks incredibly hard. Your case is solvable by letting the @benchmark macro initialise the input locally. This avoids the lookup time for global variables and doesn’t get constant-propped away (at least not on my machine):
julia> @benchmark co(A, B, UInt64(1), UInt32(0), UInt32(0)) setup = begin A = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...); B = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...) end
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 54.813 ns (0.00% GC)
median time: 54.914 ns (0.00% GC)
mean time: 56.826 ns (0.00% GC)
maximum time: 165.957 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 987
julia> @benchmark co2(A, B, UInt64(1), UInt32(0), UInt32(0)) setup = begin A = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...); B = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...) end
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 51.064 ns (0.00% GC)
median time: 51.165 ns (0.00% GC)
mean time: 52.313 ns (0.00% GC)
maximum time: 140.932 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 987