Performance of value assignment inside functions

I am puzzled by the performance of Julia, somehow it matters if I fully specialize the function argument.
Consider the two functions:

@noinline @inbounds function update_gen!(y::AbstractVector{T}, x::AbstractVector{T}) where {T<:Number}
	y[1] = x[1]
	return nothing
end

@noinline @inbounds function update_float!(y::Vector{Float64}, x::Vector{Float64})
	y[1] = x[1]
	return nothing
end

and called with some input

t_a = [51.31]
t_b = [12.3]

update_gen!(t_a, t_b)
update_float!(t_a, t_b)

No matter how I measure performance (@time or @benchmark …) update_float! is always faster by a factor of ~2, while the output of @code_llvm and @code_native is as expected exactly the same for both version.
I would be very grateful for an explanation of this behavior.

This is strange indeed, as I don’t see such a difference.

julia> @benchmark update_float!($t_a, $t_b)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  14.715 ns … 30.268 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.066 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.113 ns ±  0.324 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂▃▃▅█▇                                             ▂
  ▄▃▁▄▄▅▅▆▇██████▇▃▁▄▃▁▁▄▄▁▁▃▃▁▃▄▃▄▄▅▅▆▅▆▆▅▆▆▆▇▇▆▇▆▇▇▇▇▇▆▇▇▇▆ █
  14.7 ns      Histogram: log(frequency) by time      16.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark update_gen!($t_a, $t_b)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  14.693 ns …  1.314 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.066 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.233 ns ± 13.017 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▂▃▂▃▃▇▇█                                     ▂
  ▄▁▁▁▁▄▄▃▁▅▄▆▅▆██████████▆▁▃▁▁▃▁▁▁▁▁▁▁▃▁▁▃▄▄▁▃▁▃▃▁▃▁▃▃▁▃▄▄▄▄ █
  14.7 ns      Histogram: log(frequency) by time      15.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Try dropping the interpolation of the in/output and the Float64version becomes significantly faster.

Dropping the interpolation means that you are timing dynamic dispatch, which is not typically relevant to performance in real applications.