Economy-sized QR decomposition

qrebjock · March 20, 2021, 3:14pm

Hello,

It seems like the function qr doesn’t take keywords argument to return an economy-sized factorization.
In this thread qr(A; full=false) no longer available? · Issue #27397 · JuliaLang/julia · GitHub they say that the internal representation is efficient so it shouldn’t be a problem. But when it comes to do matrix multiplications with the Q factor I get weird performances.

Here is an example:

function fun1()
    A = randn(10000, 500)
    B = randn(10000, 500)
    Q, R = qr(A)
    return Q * B
end

function fun2()
    A = randn(10000, 500)
    B = randn(500, 500)
    Q, R = qr(A)
    return Q * B
end

The function fun2 should use the thin QR when evaluating the product Q * B.
But when we time the functions

using BenchmarkTools
@btime fun1();
@btime fun2();

we find that fun2 is as slow as fun1, or even slower because there are more allocations.

So my question would be: how to take advantage of the thin QR decomposition to get better performances?

Side note: If I try to do the multiplication with Q[:, 1:500] * B it is even slower (actually it takes forever – not sure what’s happening here).

baggepinnen · March 20, 2021, 3:33pm

What is the typeof(Q) ? Maybe you need to do Matrix(Q) to get an ordinary matrix which is fast to multiply by. The Q returned by qr might be lazy.

Possibly related
https://github.com/JuliaLang/julia/issues/38972
https://github.com/JuliaLang/julia/issues/37102

ettersi · March 21, 2021, 2:50am

Q*Bthin internally zero-extends the n x n matrix Bthin to an m x n matrix Bfull and then computes Q*Bfull, see

github.com

JuliaLang/julia/blob/6913f9ccb156230f54830c8b7b8ef82b41cac27e/stdlib/LinearAlgebra/src/qr.jl#L605-L616


      
          function (*)(A::AbstractQ, B::StridedMatrix)
              TAB = promote_type(eltype(A), eltype(B))
              Anew = convert(AbstractMatrix{TAB}, A)
              if size(A.factors, 1) == size(B, 1)
                  Bnew = copy_oftype(B, TAB)
              elseif size(A.factors, 2) == size(B, 1)
                  Bnew = [B; zeros(TAB, size(A.factors, 1) - size(B,1), size(B, 2))]
              else
                  throw(DimensionMismatch("first dimension of matrix must have size either $(size(A.factors, 1)) or $(size(A.factors, 2))"))
              end
              lmul!(Anew, Bnew)
          end

That’s why you do not get any speedup in your fun2().

According to the documentation for qr(), you can get the thin Q factor as a dense matrix through Matrix(Q), so Matrix(Q)*Bthin avoids the zero-padding, but it seems that the dense matrix algebra is less efficient than working with the compressed Householder representation:

julia> A = rand(10000,500)
       Bfull = rand(10000,500)
       Bthin = rand(500,500)
       Q,R = qr(A)

       @time Q*Bfull
       @time Matrix(Q)*Bthin
       ;
  0.184003 seconds (4 allocations: 40.054 MiB)
  0.300148 seconds (8 allocations: 116.349 MiB, 5.89% gc time)

qrebjock · March 21, 2021, 8:59pm

Thank you for your replies.

The type of Q is LinearAlgebra.QRCompactWYQ{Float64,Array{Float64,2}}. And indeed, Matrix(Q) returns a 10000 x 500 matrix. I understand better the internal representations now.

I have the same results as you, ettersi. When I perform the operation Matrix(Q) * Bthin it is as slow, or even slower than Q * Bfull.
Do you think it is something that we can hope to improve? That is, is it an issue with the way Julia builds Matrix(Q)? Or there is some inherent limitation of the QR decomposition that makes the thin decomposition not much faster than the full one?

ettersi · March 22, 2021, 1:17am

On second thought, it is not too hard to see why Matrix(Q) * Bthin is slower than Q * Bthin: the QRCompactWYQ data type represents Q as a sequence of Householder reflectors,

Q = \prod_{k = 1}^n (I - 2 \, u_k u_k^T) .

Therefore:

Q * Bthin = Q * [Bthin; zeros(m-n,n)] requires O(m*n) operations.
Matrix(Q) * Bthin requires O(m*n^2) operations.

This probably also explains why Julia (or LAPACK for that matter) does not provide an “optimised” code for Q*Bthin. The runtime of this operation would still be O(m*n), i.e. it would have the same asymptotic complexity as Q*Bpadded and “only” a slightly better prefactor. Specifically, I believe Q*Bpadded requires roughly 5*m*n additions and multiplications while an optimal implementation of Q*Bthin would require (2m + 3n)*n additions and multiplications.

qrebjock · March 22, 2021, 9:02am

I see, thank you that’s very helpful!

It seems like that computing the product Q * Bthin is already mostly optimized.

I am just wondering: is the zero padding allocating memory? Or it is something performed lazily and internally Julia knows that there is just a bunch of zeros?

ettersi · March 22, 2021, 3:41pm

No, the zero padding allocates memory.

qrebjock · March 22, 2021, 4:58pm

Ok, I think this explains why it doesn’t scale linearly in m (considering A is m x n). It’s slightly more than linear.

When I am trying the same code on MATLAB it does scale linearly if I use the thin QR decomposition.

Is there any chance to rewrite things in a way that would be linear? Maybe with low-level LAPACK functions? I am looking into it but it’s not so clear.

ettersi · March 25, 2021, 12:52am

I suspect that Julia failing to be linear in m is actually a sign of how optimised the code is. Specifically, I suspect that the nonlinearity is due to slower memory accesses if you operate on a larger amount of data. If so, the only way to achieve linear scaling would be to slow down the fast processing of small amounts of data which of course would make no sense.

This is just a guess. We might be able to come to more meaningful conclusions if you share the hard data which lead to your conclusion that Matlab scales linearly but Julia doesn’t.

qrebjock · April 5, 2021, 4:46pm

Ah I see what you mean. Yes it wouldn’t make sense. Actually I benchmarked a little more rigorously and here are the results I get.

In Julia:

using LinearAlgebra
using Plots

function fun(A, B)
    Q = qr(A).Q
    C = Q * B
end

function benchmark(n, sizes, num_repeats)
    num_sizes = length(sizes)
    times = zeros(num_sizes)
    for i = 1:num_sizes
        s = 0
        for repeat = 1:num_repeats
            A = randn(sizes[i], n)
            B = randn(n, n)
            s += (@timed fun(A, B)).time
        end
        times[i] = s / num_repeats
    end
    return times
end

n = 500
num_repeats = 5
num_sizes = 20
sizes = Int.(floor.(exp10.(range(3, stop=4, length=num_sizes))))

times = benchmark(n, sizes, num_repeats)
plot(sizes, times, xaxis=:log, yaxis=:log, label="")
plot!(sizes, sizes ./ 1e5, xaxis=:log, yaxis=:log, label="")

The plot I get:

In MATLAB:

n = 500;
num_repeats = 5;
num_sizes = 20;
sizes = floor(logspace(3, 4, num_sizes));
times = zeros(num_sizes, 1);

for i = 1:num_sizes
    disp(sizes(i))
    s = 0;
    for repeat = 1:num_repeats
        A = randn(sizes(i), n);
        B = randn(n, n);
        tic
        Q = qr(A, 0);
        C = Q * B;
        s = s + toc;
    end
    times(i) = s / num_repeats;
end

hold on
set(gca, 'XScale', 'log', 'YScale', 'log')
loglog(sizes, times)
loglog(sizes, sizes / 1e5)

The plot I get:

As you see both scale more or less linearly. But MATLAB is much faster.
There are maybe a few things to change in my code for the benchmark to be more fair. Let me know if it’s the case.

baggepinnen · April 6, 2021, 6:34am

What do you get from

julia> LinearAlgebra.BLAS.vendor()
:openblas64

?
Matlab uses MKL as BLAS, which is usually much faster than openblas. You can use MKL by installing MKL.jl

Topic		Replies	Views
Copying a QR decomposition matrix Q is extremely slow Performance question , linearalgebra	7	960	November 22, 2020
QR Decomposition Extremely Slow General Usage linearalgebra	4	1379	October 30, 2018
Why is work with Q in qr() is so expensive? Performance bug , linearalgebra , qr	6	641	April 13, 2021
Extract submatrix from QR factor is slow Julia 1.4 General Usage	3	676	March 27, 2020
Usage of QR Factors General Usage linearalgebra	12	1990	August 26, 2019

Economy-sized QR decomposition

Related topics