I could use a fast computation of an orthonormalized basis to represent columns of a tall and narrow dense matrix.

Taking advantage of an idle hour, I came up with this modified Gram-Schmidt:

```
using LinearAlgebra
function coldot(A, j, i)
m = size(A, 1)
r = zero(eltype(A))
@simd for k in 1:m
r += A[k, i] * A[k, j]
end
return r;
end
function colnorm(A, j)
return sqrt(coldot(A, j, j));
end
function colsubt!(A, i, j, r)
m = size(A, 1)
@simd for k in 1:m
A[k, i] -= r * A[k, j]
end
end
function normalizecol!(A, j)
m = size(A, 1)
r = 1.0 / colnorm(A, j)
@simd for k in 1:m
A[k, j] *= r
end
end
function mgsortho!(A)
@show m, n = size(A)
normalizecol!(A, 1)
for j in 2:n
Base.Threads.@threads for i in j:n
colsubt!(A, i, j-1, coldot(A, j-1, i))
end
normalizecol!(A, j)
end
return A
end
```

Initially I found it (predictably) slower than the built in `qr!`

. For instance for a 200,000\times 200 matrix, my version was more than three times slower.

However, If I have a multiprocessor, I can use more threads and things change quite a bit.

For a matrix 2,000,000\times 400, using multiple threads

```
# threads qr! mgsortho!
1 232 s 305 s
2 216 s 170 s
4 212 s 141 s
```

To conclude, no question but a challenge: care to make it faster still?