The interface in LinearAlgebra.LAPACK is really pleasant (if you can figure out which routines you want…), but the performance is not what I would have hoped for:
using LinearAlgebra
function computeQ!(A)
m, n = size(A)
k = min(m, n)
tau = zeros(eltype(A), k)
LAPACK.geqrf!(A, tau)
LAPACK.orgqr!(A, tau, k)
return A
end
@show Threads.nthreads()
@show BLAS.get_num_threads()
m = 2_000_000
n = 400
A = rand(m, n)
B = deepcopy(A)
A .= B
computeQ!(A)
@show norm(A'*A - LinearAlgebra.I)
A .= B
@time computeQ!(A)
gives
Threads.nthreads() = 4
BLAS.get_num_threads() = 4
norm(A' * A - LinearAlgebra.I) = 4.035802930011282e-14
110.681526 seconds (5 allocations: 203.500 KiB)
This is on my slower older desktop compared to the faster 6 core machine on which I ran the blocked MGS-like algorithm yesterday. But still, that’s a little disappointing. I separated them and the time is about evenly split between geqrf! and orgqr!, which is not surprising; orgqr! applys the same reflectors using the same routines. The standard Julia qr! which calls geqrt! runs in about 22 seconds on this problem on my system, so it’s about twice as fast as geqrf!.
The main difference is that geqrt is recursive and it stores the matrices T used to represent blocks of reflectors as I - V T V^T. geqrf generates the T on the fly from the individual reflectors. I think geqrt is considered the more state-of-the-art algorithm. But it doesn’t seem to have an equivalent to orgqr!, or at least I didn’t find it poking around LAPACK. There is a routine gemqrt for multiplication that is used by lmul!. I think the vector tau needed by orgqr! should be stored on the diagonals of the matrices T stored in the struct returned by qr!. You could perhaps pull those out and combine qr! with orgqr!. But constructing Q would definitely be the slower part of that computation.
I really am surprised there isn’t something analogous to orgqr made to work with the output of geqrt. If it exists, it seems like it should be named orgqrt, but I don’t see that anywhere.