I’d like to compute QR factorisation Q, R = qr(A) followed by the operation Matrix(Q)' * Y. As far as I can tell, Matrix(Q) allocates another matrix the same size as A which I’d like to avoid.

Is there a way to perform Matrix(Q)' * Y that doesn’t allocate the matrix Matrix(Q)? I can of course do (Q' * Y)[1:Ncols] where Ncols where Ncols = size(A,2), but if Ncols << Nrows then I would prefer to avoid this if possible.

Yes, that is what i do - see OP (with a typo, now corrected). I probably shouldn’t have worried about it, the question was partly out of curiosity, but partly motivated by the fact that my systems are on the order 1M x 10K.

I see - so it would be a matter of deciding the representation Q and implementing it in Julia. Probably not worth it. The runtime difference always seems to be less than O(10) anyhow.

You can also try QR (as opposed to QRCompactWY), I believe you can construct it via LinearAlgebra.qrfactUnblocked!. It has the benefit that that the subdiagonals of F.factors along with F.τ tell you what the Householder reflectors are, so are easy to use in a custom implementation.

This is what I use in BandedMatrices new implementation of QR, it was remarkably easy to get working.

Thanks for the suggestion. Do you know if there Is any/much performance difference ? Im particularly worries about memory usage. My LSQ systems are at the limit of what my system can handle.

QR is very slightly more memory-efficient than QRCompactWY, because QR stores a vector τ of length min(m,n) while QRCompactWY stores a matrix T of size min(m,n,36) x min(m,n) (36: default block size). But both structs have a matrix factors of size m x n so I’d say it is pretty negligible. IIUC the benefit of QRCompactWY is performance since you can use matrix-matrix multiplication as a building block of the algorithm rather than matrix-vector multiplication.

Note that LinearAlgebra.qrfactUnblocked! calls slow, generic Julia code, so you’ll need to call LAPACK directly:

fastQR!(A) = QR(LAPACK.geqrf!(A)...);

As @tkf said supposedly QRCompactWY takes advantage of BLAS Level 3 more, but in a quick test on a 10_000 x 1000 matrix I didn’t see any performance difference between QR and QRCompactWY.

Maybe the default block size (currently 36) for QRCompactWY is too small? Increasing it to something like 100 did help a lot. In Julia master it now has blocksize argument so you can increase it: https://github.com/JuliaLang/julia/pull/33053. Of course, you can call LAPACK.geqrt! directly too. I wonder if we should increase the default.