That is a lovely idea! Quite a bit faster!
I want to accomplish this:
q = qr!(A)
A .= Matrix(q.Q)
I apologize for being dense, but I fail to see how your example accomplishes that.
That call to lmul!
should return the orthogonalized matrix. (And does when I run it.) Is it the part about doing it in-place in A
that you are worried about? That seems problematic, given that A
is being used to store the Householder vectors in q.Q
. I donāt know a way to completely avoid allocating a new matrix to store the non-Householder representation of Q.
Edit: Digging a little, LAPACK has a routine DORGQR that does exactly this for a QR factorization computed from of dgeqrf
using an additional workspace array of size NB*N
where NB
is the block size. Itās even available in Julia at a low level:
julia> using LinearAlgebra
help?> LAPACK.orgqr!
orgqr!(A, tau, k = length(tau))
Explicitly finds the matrix Q of a QR factorization after calling geqrf! on A. Uses the output of geqrf!. A is overwritten by Q.
It looks like what qr!
does, however, is call the alternate algorithm dgeqrt
. I donāt see a fully in-place routine for constructing Q
in that case. This should be workable, but maybe requires low level calls to LAPACK for dgeqrf
and dorgqr
. I donāt see anything higher level provided by LinearAlgebra
, anyway, that would facilitate the use of dorgqr
.
? I guess by then writing A .= lmul!(q.Q, Matrix{eltype(A)}(I, size(A)))
. Note that this is a two step process; first Matrix{eltype(A)}(I, size(A))
allocates a new matrix to store the result, in particular by initializing it as the first columns of the identity matrix. Then the result is written in that place using lmul!(q.Q, Matrix{eltype(A)}(I, size(A)))
. Finally, the contents of that result is copied back into A
using A .= lmul!(q.Q, Matrix{eltype(A)}(I, size(A)))
Not sure what you are after, but with Lapack QR you will not manage to make the columns of A
orthogonal without allocating an intermediate result. The same happens with A .= Matrix(q.Q)
. The call Matrix(q.Q)
first allocates the result, which then afterwards is copied back into the memory of A
.
This is actually slower for me.
@time q = qr!(A)
@time A .= Matrix(q.Q)
@time A .= lmul!(q.Q, Matrix{eltype(A)}(I, size(A)))
gives
30.569143 seconds (5 allocations: 562.625 KiB)
51.667263 seconds (9 allocations: 5.968 GiB, 5.91% gc time)
64.343821 seconds (8 allocations: 2.988 GiB)
An advantage seems to be half the allocation of A .= Matrix(q.Q)
.
On Windows 10, single thread.
Edit: It is weird that there is a difference at all. Matrix(q.Q)
seems to be implemented with lmul!
!
Great find! I will pursue this a bit, since the cost of materializing Q is so huge.
This is a good question. On my test examples I saw very reasonable numerical stability. But, as you say, it is some form of hybrid algorithm.
I am also suspicious that the timings are so different. As for the allocation, it seems that this was fixed only recently:
The interface in LinearAlgebra.LAPACK
is really pleasant (if you can figure out which routines you wantā¦), but the performance is not what I would have hoped for:
using LinearAlgebra
function computeQ!(A)
m, n = size(A)
k = min(m, n)
tau = zeros(eltype(A), k)
LAPACK.geqrf!(A, tau)
LAPACK.orgqr!(A, tau, k)
return A
end
@show Threads.nthreads()
@show BLAS.get_num_threads()
m = 2_000_000
n = 400
A = rand(m, n)
B = deepcopy(A)
A .= B
computeQ!(A)
@show norm(A'*A - LinearAlgebra.I)
A .= B
@time computeQ!(A)
gives
Threads.nthreads() = 4
BLAS.get_num_threads() = 4
norm(A' * A - LinearAlgebra.I) = 4.035802930011282e-14
110.681526 seconds (5 allocations: 203.500 KiB)
This is on my slower older desktop compared to the faster 6 core machine on which I ran the blocked MGS-like algorithm yesterday. But still, thatās a little disappointing. I separated them and the time is about evenly split between geqrf!
and orgqr!
, which is not surprising; orgqr!
applys the same reflectors using the same routines. The standard Julia qr!
which calls geqrt!
runs in about 22 seconds on this problem on my system, so itās about twice as fast as geqrf!
.
The main difference is that geqrt
is recursive and it stores the matrices T used to represent blocks of reflectors as I - V T V^T. geqrf
generates the T on the fly from the individual reflectors. I think geqrt
is considered the more state-of-the-art algorithm. But it doesnāt seem to have an equivalent to orgqr!
, or at least I didnāt find it poking around LAPACK. There is a routine gemqrt
for multiplication that is used by lmul!
. I think the vector tau
needed by orgqr!
should be stored on the diagonals of the matrices T stored in the struct returned by qr!
. You could perhaps pull those out and combine qr!
with orgqr!
. But constructing Q would definitely be the slower part of that computation.
I really am surprised there isnāt something analogous to orgqr
made to work with the output of geqrt
. If it exists, it seems like it should be named orgqrt
, but I donāt see that anywhere.
Could you try it with MKL.jl
?
Assuming you use the default OpenBLAS
, it is not known for optimized LAPACK routines.
To provide a bit of context: the package SubSIt.jl implements the subspace iteration algorithm for the solution of the generalized eigenvalue problem of free vibration (positive semi-definite pencil of sparse symmetric matrices, smallest-magnitude eigenvalues, size of matrices many times larger than the number of eigenvalues, but number of eigenvalues can go into thousands). The original algorithm has been tweaked in a couple of ways, one of which is to orthonormalize the current guess of the subspace (it appears to is significantly improve convergence). I also have a suspicion (no theory, as yet) that the subspace only needs to be close to orthonormal, and does not need to be perfectly orthonormal.
It is really necessary to materialize the matrix of the orthonormal basis, hence my struggle with qr!
: it needs to be fast, and it needs to avoid allocations (the matrix of the basis can be tremendously large).
At this point I have implemented the very neat algorithm of @mstewart (many thanks!), with minor improvements, and SubSIt can now regularly beat Arpack.
Yeah. I was just going with OpenBLAS. I split up some of the timings and the code looks like:
using LinearAlgebra
using MKL
@show Threads.nthreads()
@show BLAS.get_num_threads()
@show BLAS.get_config()
m = 2_000_000
n = 400
A = rand(m, n)
tau = zeros(Float64, n)
B = deepcopy(A)
A .= B
@time LAPACK.geqrf!(A, tau)
@time LAPACK.orgqr!(A, tau, n)
A .= B
@time qr!(A)
For OpenBLAS, commenting out the using MKL
, I get:
Threads.nthreads() = 4
BLAS.get_num_threads() = 4
BLAS.get_config() = LBTConfig([ILP64] libopenblas64_.so)
55.597908 seconds (58.73 k allocations: 3.234 MiB, 0.06% compilation time)
55.310785 seconds (96.04 k allocations: 4.970 MiB, 0.11% compilation time)
22.972657 seconds (75.61 k allocations: 4.162 MiB, 0.12% compilation time)
That was about what I was seeing before. With MKL it is:
Threads.nthreads() = 4
BLAS.get_num_threads() = 4
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
36.772481 seconds (58.73 k allocations: 4.383 MiB, 0.09% compilation time)
20.155890 seconds (96.04 k allocations: 737.587 MiB, 0.30% compilation time)
24.449802 seconds (75.61 k allocations: 4.162 MiB, 0.55% compilation time)
So itās a decent gain for geqrf!
, a huge win for building Q
using orgqr!
, and essentially a tie for qr!
.
Cool! Iām glad it was useful. I hope the level of orthogonality continues to be adequate. From your matrix dimensions I had guessed there was probably a large sparse eigenvalue problem behind it all. That 2,000,000\times 400 test was really tough on the 16GB machines I was trying it out on. I felt the pain of unnecessary allocations.
If you do run into numerical trouble, there is this work on block MGS algorithms that seem to have numerical properties similar to MGS. They look a bit more involved than my suggestion.
Splendid! This is the solution. Many thanks!
If you want to go in the direction of using LAPACK semi-directly then you can try:
function computeQT!(A; nb = 32)
m, n = size(A)
k = min(m, n)
T = zeros(eltype(A), nb, k)
tau = zeros(eltype(A), k)
LAPACK.geqrt!(A, T)
b = Int(ceil(k / nb))
ib = k - (b - 1) * nb
for l = 1:(b - 1)
for j = 1:nb
tau[(l - 1) * nb + j] = T[j, (l - 1) * nb + j]
end
end
for j = 1:ib
tau[(b - 1) * nb + j] = T[j, (b - 1) * nb + j]
end
LAPACK.orgqr!(A, tau, k)
return A
end
Benchmarking three versions with MKL:
@show Threads.nthreads()
@show BLAS.get_num_threads()
@show BLAS.get_config()
m = 2_000_000
n = 400
A = rand(m, n)
B = deepcopy(A)
A .= B
@time computeQT!(A)
@show norm(A'*A - LinearAlgebra.I)
A .= B
@time computeQ!(A)
@show norm(A'*A - LinearAlgebra.I)
A .= B
@time mgsortho3!(A)
@show norm(A'*A - LinearAlgebra.I)
gave
Threads.nthreads() = 6
BLAS.get_num_threads() = 6
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
19.251219 seconds (204.59 k allocations: 743.491 MiB, 0.38% compilation time)
norm(A' * A - LinearAlgebra.I) = 4.9510605234084604e-14
28.385975 seconds (33.59 k allocations: 734.794 MiB, 0.06% compilation time)
norm(A' * A - LinearAlgebra.I) = 5.3692519363244134e-14
20.521297 seconds (7.44 M allocations: 324.980 MiB, 0.40% gc time, 5.91% compilation time)
norm(A' * A - LinearAlgebra.I) = 4.348223673677305e-13
4.348223673677305e-13
It also seems like Iām seeing perhaps a little better orthogonality for Housholder approaches. After looking at whatās out there for blocked Gram-Schmidt, I think mgsortho3!
is probably a little too naive and lacking in theory to really trust. For computeQT!
it may look like Iām depending on weird LAPACK internals, but the algorithm and storage format that guarantees that the elements of tau
are where I pulled them out of T
are part of the documentation of LAPACK (d|z|s|c)geqrt
that is called by geqrt!
. So it shouldnāt be something that changes on you.