Is there a way to get the results without any allocation? I am aware of this discussion, but I could not find a discussion about the subject regarding allocations.
Because it does pivoted QR, which is more expensive (but more stable) than the normal-equations approach (what you are calling the “OLS formula”). In particular, when minimizing \Vert Ax - b\Vert, the QR method needs to allocate storage comparable in size to the original matrix A, whereas the normal-equation approach needs allocations only proportional to the size of the much smaller matrix A^T A. See also:
Note, by the way, that inv(x'*x) * x' * y is suboptimal in multiple respects. (Besides the potential inaccuracy compared to QR if x is badly conditioned.) First, you don’t really need to invert the matrix. You can use (x'x) \ ... instead, or even cholesky(Hermitian(x'x)) \ ... to explicitly tell Julia to use a Cholesky factorization since x'x is SPD. Second, beware that multiplication in Julia is left-associative, so inv(x'*x) * x' * y is really computing (inv(x'*x) * x') * y, which requires an inefficient matrix–matrix product, rather than inv(x'*x) * (x' * y) that involves a matrix–vector product. Parentheses matter (for performance) in matrix expressions!
However, in large problems the cost of the normal-equations approach is usually dominated by the cost of computing A^T A.
As you noticed, the first step is to pre-allocate matrices for intermediate computations. Use mul! for pre-allocated multiplications, and ldiv! for pre-allocated backslash. Don’t use inv, as commented above, use a factorization object with pre-allocated storage.
For example, in the normal-equations approach, we can use cholesky! to do everything with zero allocations (after pre-allocation):
m, n = 1000, 10
A = rand(m, n); b = rand(m); # input data
# preallocate outputs and temporaries:
AᵀA = similar(A, n, n);
Aᵀb = similar(b, n);
x = similar(b, n);
which gives:
julia> @btime ldiv!($x, cholesky!(Hermitian(mul!($AᵀA, $A', $A))),
mul!($Aᵀb, $A', $b));
14.026 μs (0 allocations: 0 bytes)
julia> x ≈ A \ b
true
As mentioned in the link above, you should be cautious about using the normal equations if A might be badly conditioned; that’s why it’s not the default for A \ b.
Doing the QR approach in-place is harder. There is a qr! function that can exploit pre-allocated output, but it’s not completely allocation-free:
The reason for this is that LAPACK’s QR algorithm requires additional working storage beyond the storage for the output factorization, and Julia’s high-level interface currently provides no way to pre-allocate that additional storage. The only way to do it completely pre-allocated would be to call the low-level LAPACK interface.
However, unless you are solving rather tiny least-squares problems, the allocations are usually not such a dominant factor in performance anyway. For example, if you go without pre-allocating the qr output in the example above, the time increases by < 10\%:
Thank you very much for this great and detailed explanation! And thanks for pointing out the “wrong” parentheses. Below are the updated benchmarks for these cases: