I have two functions, their only difference is one uses += and another uses =, but are making a > 4x performance difference. The + is much faster than sin and cos and I suppose the following two functions should not make such a big difference. I tried these two versions in fortran, and confirmed this point. So why does Julia compiler generates slow code?
julia> using BenchmarkTools
julia> function pyramid0!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           @inbounds for i=1:size(v!,1)-1
               for j=1:size(v!,2)-i
                   v![i+1,j] = cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid0! (generic function with 1 method)
julia> let
           n = 1000
           x = collect(Float64, 1:n)
           v = zeros(1000, 1000)
           @benchmark pyramid0!($v, $x) seconds=1
       end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.902 ms (0.00% GC)
  median time:      5.913 ms (0.00% GC)
  mean time:        5.943 ms (0.00% GC)
  maximum time:     6.341 ms (0.00% GC)
  --------------
  samples:          169
  evals/sample:     1
julia> function pyramid0!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           @inbounds for i=1:size(v!,1)-1
               for j=1:size(v!,2)-i
                   v![i+1,j] += cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid0! (generic function with 1 method)
julia> let
           n = 1000
           x = collect(Float64, 1:n)
           v = zeros(1000, 1000)
           @benchmark pyramid0!($v, $x) seconds=1
       end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.104 ms (0.00% GC)
  median time:      25.257 ms (0.00% GC)
  mean time:        25.393 ms (0.00% GC)
  maximum time:     28.555 ms (0.00% GC)
  --------------
  samples:          40
  evals/sample:     1
I tried Julia 1.5, 1.6 and master branch.