Extract submatrix from QR factor is slow Julia 1.4

Hello everyone,

I have a pretty large matrix A \sim 600 \times 200 on which I would like to perform a QR decomposition.

Performing the QR decomposition takes 10ms, that seems slow to me, is there a way to make it faster.

In the end, I only need to access a submatrix of Q and the matrix R. But forming the submatrix Q is as slow as performing the qr decomposition!!! I have tried with Matrix{Float64}(F.Q) and F.Q*Matrix(I, blabla, blabla) but both are very slow.

Here is a toy version of my code:

using BenchmarkTools
N = 200
nx = 597
λ = 0.1

ψ = rand(nx,N)
ψ_aug = zeros(N+nx, nx)

ψ_aug[1:N,:] .= ψ'
ψ_aug[N+1:N+nx,:] .= Matrix(sqrt(λ)*I, nx, nx)

@btime F = qr(ψ_aug)

Q1 = zeros(N, nx)
@btime Q1 .= view(F.Q*Matrix(I, N+nx, N+nx), 1:N, 1:nx)
@btime Q1 .= (F.Q*Matrix(I,N+nx, N+nx))[1:N, 1:nx]
@btime Q1 .= Matrix{Float64}(F.Q)[1:N, 1:nx];

ψbis  = rand(4, N) 

A = ψbis - (ψbis*Q1)*Q1';

Output of the code:

10.438 ms (7 allocations: 3.96 MiB)
12.054 ms (14 allocations: 9.08 MiB)
10.338 ms (15 allocations: 9.99 MiB)
8.910 ms (13 allocations: 10.89 MiB)

I welcome any other speed improvement on this code.

Why not just

julia> @btime Q1 = view(F.Q, 1:N, 1:nx);
  115.400 ns (4 allocations: 160 bytes)

For comparison, on my computer

julia> @btime Q1 .= Matrix{Float64}(F.Q)[1:N, 1:nx];
  34.454 ms (13 allocations: 10.89 MiB)

Unfortunatley, this solution only moves the problem since computing the A matrix (last line of the code) is now super expensive and takes 16s (timed with @btime).

Concerning the QR decomposition:
Can I set a partial qr decomposition?
Is there a qr routine optimized for multi-threading?

The Householder QR factorization (a variant of which is used by Julia’s qr function) does not compute Q explicitly — instead, it stores a sequence of “Householder reflector” vectors that can be used to compute Q*x (or Q'*x) for any vector x almost as quickly as if you had an explicit Q matrix. See, for example, these notes.

When you call Matrix(F.Q) to compute Q explicitly, it basically has to multiply the Householder reflectors by the columns of the identity matrix I. To get a rough estimate of the cost, you can look at the flop count (floating-point additions and multiplications), and indeed the flop count for computing Q explicitly is similar to the flop count for the QR factorization itself for square matrices, and in fact is larger if the original matrix is very “tall” (non-square)!

To get a faster approach that avoids forming a Q submatrix explicitly, you should realize that Q1 = Q[1:N, 1:nx] (where Q is N+nx × N+nx) is equivalent to

Q1 = Matrix(I, N, N+nx) * Q * Matrix(I, N+nx, nx) = P1 * Q * P2

where P1 and P2 are very sparse. The trick is to exploit this sparsity, and multiply the P matrices by your other vectors rather than Q. You eventually want to compute (ψbis*Q1)*Q1'. This becomes:

(ψbis*Q1)*Q1' == ((ψbis*P1) * Q) * (P2*P2') * Q' * P1'

The key is to realize that al these multiplications by P are equivalent to extracting a submatrix or padding with zeros, but we can do this after we multiply by Q or Q'. That is, we are doing:

f(ψbis, Q, N, nx) = ([([ψbis zeros(size(ψbis,1), nx)] * Q)[axes(ψbis,1), 1:nx] zeros(size(ψbis,1), N)] * Q')[axes(ψbis,1), 1:N]

On my machine,

@btime f($ψbis, $(F.Q), $N, $nx);

takes about 850µs and produces the same result as (ψbis*Q1)*Q1', whereas QR factorization took about 12ms.

Ideally, this would be automated. For that to happen, F.Q[range1, range2] would have to return a new kind of matrix-like object that, upon multiplication, implicitly pads its input with zeros and extracts range1 rows of the result (or vice versa for left-multiply or multiplication by the transpose). This is certainly possible to implement in Julia (in the same way that F.Q itself is a matrix-like object that implicitly multiplies by the Householder reflectors). A PR for this functionality might be of interest.

PS. To get the first nx columns of Q, a faster way is to multiply F.Q * Matrix(I, size(F.Q, 1), nx), but I don’t know of similarly easy and efficient way to get the first N rows of the first nx columns. (The algorithm for the latter in principle exists, but it would have to be implemented.)