Because it does pivoted QR, which is more expensive (but more stable) than the normal-equations approach (what you are calling the “OLS formula”). In particular, when minimizing \Vert Ax - b\Vert, the QR method needs to allocate storage comparable in size to the original matrix A, whereas the normal-equation approach needs allocations only proportional to the size of the much smaller matrix A^T A. See also:
Note, by the way, that inv(x'*x) * x' * y
is suboptimal in multiple respects. (Besides the potential inaccuracy compared to QR if x
is badly conditioned.) First, you don’t really need to invert the matrix. You can use (x'x) \ ...
instead, or even cholesky(Hermitian(x'x)) \ ...
to explicitly tell Julia to use a Cholesky factorization since x'x
is SPD. Second, beware that multiplication in Julia is left-associative, so inv(x'*x) * x' * y
is really computing (inv(x'*x) * x') * y
, which requires an inefficient matrix–matrix product, rather than inv(x'*x) * (x' * y)
that involves a matrix–vector product. Parentheses matter (for performance) in matrix expressions!
However, in large problems the cost of the normal-equations approach is usually dominated by the cost of computing A^T A.
As you noticed, the first step is to pre-allocate matrices for intermediate computations. Use mul!
for pre-allocated multiplications, and ldiv!
for pre-allocated backslash. Don’t use inv
, as commented above, use a factorization object with pre-allocated storage.
For example, in the normal-equations approach, we can use cholesky!
to do everything with zero allocations (after pre-allocation):
m, n = 1000, 10
A = rand(m, n); b = rand(m); # input data
# preallocate outputs and temporaries:
AᵀA = similar(A, n, n);
Aᵀb = similar(b, n);
x = similar(b, n);
which gives:
julia> @btime ldiv!($x, cholesky!(Hermitian(mul!($AᵀA, $A', $A))),
mul!($Aᵀb, $A', $b));
14.026 μs (0 allocations: 0 bytes)
julia> x ≈ A \ b
true
As mentioned in the link above, you should be cautious about using the normal equations if A might be badly conditioned; that’s why it’s not the default for A \ b
.
Doing the QR approach in-place is harder. There is a qr!
function that can exploit pre-allocated output, but it’s not completely allocation-free:
julia> Atmp = similar(A);
julia> @btime qr!(copyto!($Atmp, $A), ColumnNorm()); # pivoted QR
30.034 μs (4 allocations: 3.34 KiB)
The reason for this is that LAPACK’s QR algorithm requires additional working storage beyond the storage for the output factorization, and Julia’s high-level interface currently provides no way to pre-allocate that additional storage. The only way to do it completely pre-allocated would be to call the low-level LAPACK interface.
However, unless you are solving rather tiny least-squares problems, the allocations are usually not such a dominant factor in performance anyway. For example, if you go without pre-allocating the qr
output in the example above, the time increases by < 10\%:
julia> @btime qr($A, ColumnNorm()); # pivoted QR, without pre-allocation
32.279 μs (6 allocations: 81.52 KiB)
and the time for the normal-equations approach is virtually unchanged without pre-allocation in this example:
julia> @btime cholesky!(Hermitian($A'*$A)) \ ($A'*$b);
14.104 μs (3 allocations: 1.16 KiB)
A lot more readable!