In-place computation of Q (from QR decomposition)

PetrKryslUCSD · November 28, 2022, 3:56pm

That is a lovely idea! Quite a bit faster!

PetrKryslUCSD · November 28, 2022, 5:48pm

I want to accomplish this:

q = qr!(A)
A .= Matrix(q.Q)

I apologize for being dense, but I fail to see how your example accomplishes that.

mstewart · November 28, 2022, 6:52pm

That call to lmul! should return the orthogonalized matrix. (And does when I run it.) Is it the part about doing it in-place in A that you are worried about? That seems problematic, given that A is being used to store the Householder vectors in q.Q. I don’t know a way to completely avoid allocating a new matrix to store the non-Householder representation of Q.

Edit: Digging a little, LAPACK has a routine DORGQR that does exactly this for a QR factorization computed from of dgeqrf using an additional workspace array of size NB*N where NB is the block size. It’s even available in Julia at a low level:

julia> using LinearAlgebra

help?> LAPACK.orgqr!
  orgqr!(A, tau, k = length(tau))

  Explicitly finds the matrix Q of a QR factorization after calling geqrf! on A. Uses the output of geqrf!. A is overwritten by Q.

It looks like what qr! does, however, is call the alternate algorithm dgeqrt. I don’t see a fully in-place routine for constructing Q in that case. This should be workable, but maybe requires low level calls to LAPACK for dgeqrf and dorgqr. I don’t see anything higher level provided by LinearAlgebra, anyway, that would facilitate the use of dorgqr.

juthohaegeman · November 28, 2022, 6:52pm

? I guess by then writing A .= lmul!(q.Q, Matrix{eltype(A)}(I, size(A))). Note that this is a two step process; first Matrix{eltype(A)}(I, size(A)) allocates a new matrix to store the result, in particular by initializing it as the first columns of the identity matrix. Then the result is written in that place using lmul!(q.Q, Matrix{eltype(A)}(I, size(A))). Finally, the contents of that result is copied back into A using A .= lmul!(q.Q, Matrix{eltype(A)}(I, size(A)))

Not sure what you are after, but with Lapack QR you will not manage to make the columns of A orthogonal without allocating an intermediate result. The same happens with A .= Matrix(q.Q). The call Matrix(q.Q) first allocates the result, which then afterwards is copied back into the memory of A.

PetrKryslUCSD · November 29, 2022, 1:04am

This is actually slower for me.

@time q = qr!(A)
@time A .= Matrix(q.Q)
@time A .= lmul!(q.Q, Matrix{eltype(A)}(I, size(A)))

gives

 30.569143 seconds (5 allocations: 562.625 KiB)                                                        
 51.667263 seconds (9 allocations: 5.968 GiB, 5.91% gc time)                                           
 64.343821 seconds (8 allocations: 2.988 GiB)

An advantage seems to be half the allocation of A .= Matrix(q.Q).

On Windows 10, single thread.

Edit: It is weird that there is a difference at all. Matrix(q.Q) seems to be implemented with lmul!!

github.com

JuliaLang/julia/blob/865007d701341526bc5442fcb2409a82a82a7ccd/stdlib/LinearAlgebra/src/qr.jl#L575


      
          # backwards-compatible constructors (remove with Julia 2.0)
          @deprecate(QRCompactWYQ{S,M}(factors::AbstractMatrix{S}, T::AbstractMatrix{S}) where {S,M},
                     QRCompactWYQ{S,M,typeof(T)}(factors, T), false)
          
          QRPackedQ{T}(Q::QRPackedQ) where {T} = QRPackedQ(convert(AbstractMatrix{T}, Q.factors), convert(Vector{T}, Q.τ))
          AbstractMatrix{T}(Q::QRPackedQ{T}) where {T} = Q
          AbstractMatrix{T}(Q::QRPackedQ) where {T} = QRPackedQ{T}(Q)
          QRCompactWYQ{S}(Q::QRCompactWYQ) where {S} = QRCompactWYQ(convert(AbstractMatrix{S}, Q.factors), convert(AbstractMatrix{S}, Q.T))
          AbstractMatrix{S}(Q::QRCompactWYQ{S}) where {S} = Q
          AbstractMatrix{S}(Q::QRCompactWYQ) where {S} = QRCompactWYQ{S}(Q)
          Matrix{T}(Q::AbstractQ{S}) where {T,S} = convert(Matrix{T}, lmul!(Q, Matrix{S}(I, size(Q, 1), min(size(Q.factors)...))))
          Matrix(Q::AbstractQ{T}) where {T} = Matrix{T}(Q)
          Array{T}(Q::AbstractQ) where {T} = Matrix{T}(Q)
          Array(Q::AbstractQ) = Matrix(Q)
          
          size(F::Union{QR,QRCompactWY,QRPivoted}, dim::Integer) = size(getfield(F, :factors), dim)
          size(F::Union{QR,QRCompactWY,QRPivoted}) = size(getfield(F, :factors))
          size(Q::Union{QRCompactWYQ,QRPackedQ}, dim::Integer) =
              size(getfield(Q, :factors), dim == 2 ? 1 : dim)
          size(Q::Union{QRCompactWYQ,QRPackedQ}) = size(Q, 1), size(Q, 2)

PetrKryslUCSD · November 29, 2022, 1:12am

Great find! I will pursue this a bit, since the cost of materializing Q is so huge.

PetrKryslUCSD · November 29, 2022, 1:18am

This is a good question. On my test examples I saw very reasonable numerical stability. But, as you say, it is some form of hybrid algorithm.

juthohaegeman · November 29, 2022, 10:00am

I am also suspicious that the timings are so different. As for the allocation, it seems that this was fixed only recently:

mstewart · November 29, 2022, 12:20pm

The interface in LinearAlgebra.LAPACK is really pleasant (if you can figure out which routines you want…), but the performance is not what I would have hoped for:

using LinearAlgebra

function computeQ!(A)
  m, n = size(A)
  k = min(m, n)
  tau = zeros(eltype(A), k)
  LAPACK.geqrf!(A, tau)
  LAPACK.orgqr!(A, tau, k)
  return A
end

@show Threads.nthreads()
@show BLAS.get_num_threads()

m = 2_000_000
n = 400
A = rand(m, n)
B = deepcopy(A)
A .= B
computeQ!(A)
@show norm(A'*A - LinearAlgebra.I)
A .= B
@time computeQ!(A)

gives

Threads.nthreads() = 4
BLAS.get_num_threads() = 4
norm(A' * A - LinearAlgebra.I) = 4.035802930011282e-14
110.681526 seconds (5 allocations: 203.500 KiB)

This is on my slower older desktop compared to the faster 6 core machine on which I ran the blocked MGS-like algorithm yesterday. But still, that’s a little disappointing. I separated them and the time is about evenly split between geqrf! and orgqr!, which is not surprising; orgqr! applys the same reflectors using the same routines. The standard Julia qr! which calls geqrt! runs in about 22 seconds on this problem on my system, so it’s about twice as fast as geqrf!.

The main difference is that geqrt is recursive and it stores the matrices T used to represent blocks of reflectors as I - V T V^T. geqrf generates the T on the fly from the individual reflectors. I think geqrt is considered the more state-of-the-art algorithm. But it doesn’t seem to have an equivalent to orgqr!, or at least I didn’t find it poking around LAPACK. There is a routine gemqrt for multiplication that is used by lmul!. I think the vector tau needed by orgqr! should be stored on the diagonals of the matrices T stored in the struct returned by qr!. You could perhaps pull those out and combine qr! with orgqr!. But constructing Q would definitely be the slower part of that computation.

I really am surprised there isn’t something analogous to orgqr made to work with the output of geqrt. If it exists, it seems like it should be named orgqrt, but I don’t see that anywhere.

RoyiAvital · November 29, 2022, 3:29pm

Could you try it with MKL.jl?
Assuming you use the default OpenBLAS, it is not known for optimized LAPACK routines.

PetrKryslUCSD · November 29, 2022, 4:57pm

To provide a bit of context: the package SubSIt.jl implements the subspace iteration algorithm for the solution of the generalized eigenvalue problem of free vibration (positive semi-definite pencil of sparse symmetric matrices, smallest-magnitude eigenvalues, size of matrices many times larger than the number of eigenvalues, but number of eigenvalues can go into thousands). The original algorithm has been tweaked in a couple of ways, one of which is to orthonormalize the current guess of the subspace (it appears to is significantly improve convergence). I also have a suspicion (no theory, as yet) that the subspace only needs to be close to orthonormal, and does not need to be perfectly orthonormal.

It is really necessary to materialize the matrix of the orthonormal basis, hence my struggle with qr!: it needs to be fast, and it needs to avoid allocations (the matrix of the basis can be tremendously large).

At this point I have implemented the very neat algorithm of @mstewart (many thanks!), with minor improvements, and SubSIt can now regularly beat Arpack.

mstewart · November 29, 2022, 6:02pm

Yeah. I was just going with OpenBLAS. I split up some of the timings and the code looks like:

using LinearAlgebra
using MKL

@show Threads.nthreads()
@show BLAS.get_num_threads()
@show BLAS.get_config()

m = 2_000_000
n = 400
A = rand(m, n)
tau = zeros(Float64, n)
B = deepcopy(A)

A .= B
@time LAPACK.geqrf!(A, tau)
@time LAPACK.orgqr!(A, tau, n)

A .= B
@time qr!(A)

For OpenBLAS, commenting out the using MKL, I get:

Threads.nthreads() = 4
BLAS.get_num_threads() = 4
BLAS.get_config() = LBTConfig([ILP64] libopenblas64_.so)
 55.597908 seconds (58.73 k allocations: 3.234 MiB, 0.06% compilation time)
 55.310785 seconds (96.04 k allocations: 4.970 MiB, 0.11% compilation time)
 22.972657 seconds (75.61 k allocations: 4.162 MiB, 0.12% compilation time)

That was about what I was seeing before. With MKL it is:

Threads.nthreads() = 4
BLAS.get_num_threads() = 4
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
 36.772481 seconds (58.73 k allocations: 4.383 MiB, 0.09% compilation time)
 20.155890 seconds (96.04 k allocations: 737.587 MiB, 0.30% compilation time)
 24.449802 seconds (75.61 k allocations: 4.162 MiB, 0.55% compilation time)

So it’s a decent gain for geqrf!, a huge win for building Q using orgqr!, and essentially a tie for qr!.

mstewart · November 29, 2022, 6:10pm

Cool! I’m glad it was useful. I hope the level of orthogonality continues to be adequate. From your matrix dimensions I had guessed there was probably a large sparse eigenvalue problem behind it all. That 2,000,000\times 400 test was really tough on the 16GB machines I was trying it out on. I felt the pain of unnecessary allocations.

mstewart · November 29, 2022, 8:53pm

If you do run into numerical trouble, there is this work on block MGS algorithms that seem to have numerical properties similar to MGS. They look a bit more involved than my suggestion.

PetrKryslUCSD · November 30, 2022, 12:13am

Splendid! This is the solution. Many thanks!

mstewart · November 30, 2022, 12:58pm

If you want to go in the direction of using LAPACK semi-directly then you can try:

function computeQT!(A; nb = 32)
  m, n = size(A)
  k = min(m, n)
  T = zeros(eltype(A), nb, k)
  tau = zeros(eltype(A), k)
  LAPACK.geqrt!(A, T)
  b = Int(ceil(k / nb))
  ib = k - (b - 1) * nb
  for l = 1:(b - 1)
    for j = 1:nb
      tau[(l - 1) * nb + j] = T[j, (l - 1) * nb + j]
    end
  end
  for j = 1:ib
    tau[(b - 1) * nb + j] = T[j, (b - 1) * nb + j]
  end
  LAPACK.orgqr!(A, tau, k)
  return A
end

Benchmarking three versions with MKL:

@show Threads.nthreads()
@show BLAS.get_num_threads()
@show BLAS.get_config()
m = 2_000_000
n = 400
A = rand(m, n)
B = deepcopy(A)
A .= B
@time computeQT!(A)
@show norm(A'*A - LinearAlgebra.I)
A .= B
@time computeQ!(A)
@show norm(A'*A - LinearAlgebra.I)
A .= B
@time mgsortho3!(A)
@show norm(A'*A - LinearAlgebra.I)

gave

Threads.nthreads() = 6
BLAS.get_num_threads() = 6
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
 19.251219 seconds (204.59 k allocations: 743.491 MiB, 0.38% compilation time)
norm(A' * A - LinearAlgebra.I) = 4.9510605234084604e-14
 28.385975 seconds (33.59 k allocations: 734.794 MiB, 0.06% compilation time)
norm(A' * A - LinearAlgebra.I) = 5.3692519363244134e-14
 20.521297 seconds (7.44 M allocations: 324.980 MiB, 0.40% gc time, 5.91% compilation time)
norm(A' * A - LinearAlgebra.I) = 4.348223673677305e-13
4.348223673677305e-13

It also seems like I’m seeing perhaps a little better orthogonality for Housholder approaches. After looking at what’s out there for blocked Gram-Schmidt, I think mgsortho3! is probably a little too naive and lacking in theory to really trust. For computeQT! it may look like I’m depending on weird LAPACK internals, but the algorithm and storage format that guarantees that the elements of tau are where I pulled them out of T are part of the documentation of LAPACK (d|z|s|c)geqrt that is called by geqrt!. So it shouldn’t be something that changes on you.

Topic		Replies	Views
Orthonormalize a matrix in place Performance linearalgebra	10	2252	January 11, 2021
Gram-Schmidt decomposition New to Julia linearalgebra	17	3802	May 5, 2025
QR Decomposition Extremely Slow General Usage linearalgebra	4	1375	October 30, 2018
Copying a QR decomposition matrix Q is extremely slow Performance question , linearalgebra	7	948	November 22, 2020
Constructing Q matrix from QR decomposition Performance linearalgebra , matrices	3	746	February 19, 2024

In-place computation of Q (from QR decomposition)

Related topics