Usage of QR Factors

I’d like to compute QR factorisation Q, R = qr(A) followed by the operation Matrix(Q)' * Y. As far as I can tell, Matrix(Q) allocates another matrix the same size as A which I’d like to avoid.

Is there a way to perform Matrix(Q)' * Y that doesn’t allocate the matrix Matrix(Q)? I can of course do (Q' * Y)[1:Ncols] where Ncols where Ncols = size(A,2), but if Ncols << Nrows then I would prefer to avoid this if possible.

Hi !

What’s wrong with q'a? That dispatches to gemqrt which is efficient and doesn’t allocate.

If it is about allocations (not necessarily speed), why convert Q at all? Simply Q' * Y should work with less allocations.

julia> using LinearAlgebra

julia> A = rand(3,3);

julia> Y = rand(3,3);

julia> Q, R = qr(A);

julia> using BenchmarkTools

julia> @btime Matrix($Q)' * $Y;
  770.550 ns (3 allocations: 480 bytes)

julia> @btime $(Q)'*$Y;
  625.294 ns (2 allocations: 320 bytes)

Regarding speed, for 100x100 I get

julia> @btime Matrix($Q)' * $Y;
  4.179 ms (6 allocations: 234.61 KiB)

julia> @btime $(Q)'*$Y;
  12.421 ms (4 allocations: 156.41 KiB)

Yes, that is what i do - see OP (with a typo, now corrected). I probably shouldn’t have worried about it, the question was partly out of curiosity, but partly motivated by the fact that my systems are on the order 1M x 10K.

OK, I see now: q’a is N x M, not M x M as I was expecting. Strangely it seems LAPACK doesn’t even expose a routine for that…

I see - so it would be a matter of deciding the representation Q and implementing it in Julia. Probably not worth it. The runtime difference always seems to be less than O(10) anyhow.

I just documented how Q is stored in memory: https://docs.julialang.org/en/latest/stdlib/LinearAlgebra/#LinearAlgebra.QRCompactWY (see the quick-and-dirty Julia implementation to recover Q here: https://github.com/JuliaLang/julia/pull/32996) Maybe you can write it in Julia if LAPACK does not have it? It sounds like something we need if we are to expose thin rectangular Q: https://github.com/JuliaLang/julia/issues/26591

2 Likes

Thanks for this - that was useful.

I agree - this seems to be the only operation I couldn’t carry out directly with the type returned by qr.

You can also try QR (as opposed to QRCompactWY), I believe you can construct it via LinearAlgebra.qrfactUnblocked!. It has the benefit that that the subdiagonals of F.factors along with F.τ tell you what the Householder reflectors are, so are easy to use in a custom implementation.

This is what I use in BandedMatrices new implementation of QR, it was remarkably easy to get working.

1 Like

Thanks for the suggestion. Do you know if there Is any/much performance difference ? Im particularly worries about memory usage. My LSQ systems are at the limit of what my system can handle.

QR is very slightly more memory-efficient than QRCompactWY, because QR stores a vector τ of length min(m,n) while QRCompactWY stores a matrix T of size min(m,n,36) x min(m,n) (36: default block size). But both structs have a matrix factors of size m x n so I’d say it is pretty negligible. IIUC the benefit of QRCompactWY is performance since you can use matrix-matrix multiplication as a building block of the algorithm rather than matrix-vector multiplication.

1 Like

Note that LinearAlgebra.qrfactUnblocked! calls slow, generic Julia code, so you’ll need to call LAPACK directly:

fastQR!(A) = QR(LAPACK.geqrf!(A)...);

As @tkf said supposedly QRCompactWY takes advantage of BLAS Level 3 more, but in a quick test on a 10_000 x 1000 matrix I didn’t see any performance difference between QR and QRCompactWY.

Maybe the default block size (currently 36) for QRCompactWY is too small? Increasing it to something like 100 did help a lot. In Julia master it now has blocksize argument so you can increase it: https://github.com/JuliaLang/julia/pull/33053. Of course, you can call LAPACK.geqrt! directly too. I wonder if we should increase the default.