Hi everyone,

I expected this to be a trivial optimization, but am struggling with writing vectorized code that doesn’t allocation memory. Consider the following code:

```
function gram_schmidt1!{T <: Real}(U::Matrix{T})
m = size(U, 2)
@inbounds for k = 1:m
uk = view(U,:,k)
@inbounds for j = 1:k-1
uk .-= (U[:,j] ⋅ uk) .* U[:,j]
end
uk ./= norm(uk)
end
end
```

Then I get

```
using BenchmarkTools
N = 100
A = randn(N, N)
@btime gram_schmidt1!($A)
6.859 ms (19900 allocations: 8.62 MiB)
```

which results in a surprisingly high number of allocations.

If I instead use a de-vectorized version,

```
function gram_schmidt2!{T <: Real}(U::Matrix{T})
n, m = size(U)
@assert n ≥ 1
@inbounds for k = 1:m
for j = 1:k-1
dotjk = U[1,j] * U[1,k]
@simd for i = 2:n
dotjk += U[i,j] * U[i,k]
end
@simd for i = 1:n
U[i,k] -= dotjk * U[i,j]
end
end
uknorm = abs2(U[1,k])
@simd for i = 2:n
uknorm += abs2(U[i,k])
end
uknorm = √(uknorm)
@simd for i = 1:n
U[i,k] /= uknorm
end
end
end
```

I get

```
@btime gram_schmidt2!($A)
287.653 μs (0 allocations: 0 bytes)
```

both on Julia v0.6.3.

I played around with benchmarking and memory allocation analysis, and even the dot product seems to allocate memory. Am I doing something fundamentally wrong here?