Potential small inefficiency in `triu!` and `tril!`?

The definition of triu! in dense.jl of LinearAlgebra is copy-pasted below. Observing line 181, we see that zero(M[i]) is calculated at each iteration of the inner loop. Obviously, this can be moved out of the loop and re-written as ZERO = zero(M[1]) just below idx = 1. Line 181 now should read M[i] = ZERO. The same thing happens again at line 225 in tril!.

function tril!(M::AbstractMatrix, k::Integer)
    m, n = size(M)
    if !(-m - 1 <= k <= n - 1)
        throw(ArgumentError(string("the requested diagonal, $k, must be at least ",
            "$(-m - 1) and at most $(n - 1) in an $m-by-$n matrix")))
    end
    idx = 1
    for j = 0:n-1
        ii = min(max(0, j-k), m)
        for i = idx:(idx+ii-1)
            M[i] = zero(M[i])
        end
        idx += m
    end
    M
end

Benchmarking the original and modified versions shows a small performance improvement of about 8% .

using BenchmarkTools
a = rand(1000,1000);
julia> @btime triu!(a,0);
  187.311 μs (0 allocations: 0 bytes)
julia> @btime my_triu!(a,0);
  173.455 μs (0 allocations: 0 bytes)

Should I open an issue to modify both functions or did I miss something?

EDIT:

Oh, sorry, my fault. I modified a between the two calls. The compiler of 0.7-alpha seems too smart and likely, it optimized that away and constant-propagated that ZERO.

julia> a = rand(1000,1000);

julia> b = copy(a);

julia> @btime triu!(a,0);
  173.969 μs (0 allocations: 0 bytes)

julia> @btime my_triu!(b,0);
  173.456 μs (0 allocations: 0 bytes)