Economy-sized QR decomposition

Hello,

It seems like the function qr doesn’t take keywords argument to return an economy-sized factorization.
In this thread qr(A; full=false) no longer available? · Issue #27397 · JuliaLang/julia · GitHub they say that the internal representation is efficient so it shouldn’t be a problem. But when it comes to do matrix multiplications with the Q factor I get weird performances.

Here is an example:

function fun1()
    A = randn(10000, 500)
    B = randn(10000, 500)
    Q, R = qr(A)
    return Q * B
end

function fun2()
    A = randn(10000, 500)
    B = randn(500, 500)
    Q, R = qr(A)
    return Q * B
end

The function fun2 should use the thin QR when evaluating the product Q * B.
But when we time the functions

using BenchmarkTools
@btime fun1();
@btime fun2();

we find that fun2 is as slow as fun1, or even slower because there are more allocations.

So my question would be: how to take advantage of the thin QR decomposition to get better performances?

Side note: If I try to do the multiplication with Q[:, 1:500] * B it is even slower (actually it takes forever – not sure what’s happening here).

What is the typeof(Q) ? Maybe you need to do Matrix(Q) to get an ordinary matrix which is fast to multiply by. The Q returned by qr might be lazy.

Possibly related
https://github.com/JuliaLang/julia/issues/38972
https://github.com/JuliaLang/julia/issues/37102

4 Likes

Q*Bthin internally zero-extends the n x n matrix Bthin to an m x n matrix Bfull and then computes Q*Bfull, see

That’s why you do not get any speedup in your fun2().

According to the documentation for qr(), you can get the thin Q factor as a dense matrix through Matrix(Q), so Matrix(Q)*Bthin avoids the zero-padding, but it seems that the dense matrix algebra is less efficient than working with the compressed Householder representation:

julia> A = rand(10000,500)
       Bfull = rand(10000,500)
       Bthin = rand(500,500)
       Q,R = qr(A)

       @time Q*Bfull
       @time Matrix(Q)*Bthin
       ;
  0.184003 seconds (4 allocations: 40.054 MiB)
  0.300148 seconds (8 allocations: 116.349 MiB, 5.89% gc time)
3 Likes

Thank you for your replies.

The type of Q is LinearAlgebra.QRCompactWYQ{Float64,Array{Float64,2}}. And indeed, Matrix(Q) returns a 10000 x 500 matrix. I understand better the internal representations now.

I have the same results as you, ettersi. When I perform the operation Matrix(Q) * Bthin it is as slow, or even slower than Q * Bfull.
Do you think it is something that we can hope to improve? That is, is it an issue with the way Julia builds Matrix(Q)? Or there is some inherent limitation of the QR decomposition that makes the thin decomposition not much faster than the full one?

On second thought, it is not too hard to see why Matrix(Q) * Bthin is slower than Q * Bthin: the QRCompactWYQ data type represents Q as a sequence of Householder reflectors,

Q = \prod_{k = 1}^n (I - 2 \, u_k u_k^T) .

Therefore:

  • Q * Bthin = Q * [Bthin; zeros(m-n,n)] requires O(m*n) operations.
  • Matrix(Q) * Bthin requires O(m*n^2) operations.

This probably also explains why Julia (or LAPACK for that matter) does not provide an “optimised” code for Q*Bthin. The runtime of this operation would still be O(m*n), i.e. it would have the same asymptotic complexity as Q*Bpadded and “only” a slightly better prefactor. Specifically, I believe Q*Bpadded requires roughly 5*m*n additions and multiplications while an optimal implementation of Q*Bthin would require (2m + 3n)*n additions and multiplications.

4 Likes

I see, thank you that’s very helpful!

It seems like that computing the product Q * Bthin is already mostly optimized.

I am just wondering: is the zero padding allocating memory? Or it is something performed lazily and internally Julia knows that there is just a bunch of zeros?

No, the zero padding allocates memory.

Ok, I think this explains why it doesn’t scale linearly in m (considering A is m x n). It’s slightly more than linear.

When I am trying the same code on MATLAB it does scale linearly if I use the thin QR decomposition.

Is there any chance to rewrite things in a way that would be linear? Maybe with low-level LAPACK functions? I am looking into it but it’s not so clear.

I suspect that Julia failing to be linear in m is actually a sign of how optimised the code is. Specifically, I suspect that the nonlinearity is due to slower memory accesses if you operate on a larger amount of data. If so, the only way to achieve linear scaling would be to slow down the fast processing of small amounts of data which of course would make no sense.

This is just a guess. We might be able to come to more meaningful conclusions if you share the hard data which lead to your conclusion that Matlab scales linearly but Julia doesn’t.

Ah I see what you mean. Yes it wouldn’t make sense. Actually I benchmarked a little more rigorously and here are the results I get.

In Julia:

using LinearAlgebra
using Plots

function fun(A, B)
    Q = qr(A).Q
    C = Q * B
end

function benchmark(n, sizes, num_repeats)
    num_sizes = length(sizes)
    times = zeros(num_sizes)
    for i = 1:num_sizes
        s = 0
        for repeat = 1:num_repeats
            A = randn(sizes[i], n)
            B = randn(n, n)
            s += (@timed fun(A, B)).time
        end
        times[i] = s / num_repeats
    end
    return times
end

n = 500
num_repeats = 5
num_sizes = 20
sizes = Int.(floor.(exp10.(range(3, stop=4, length=num_sizes))))

times = benchmark(n, sizes, num_repeats)
plot(sizes, times, xaxis=:log, yaxis=:log, label="")
plot!(sizes, sizes ./ 1e5, xaxis=:log, yaxis=:log, label="")

The plot I get:

In MATLAB:

n = 500;
num_repeats = 5;
num_sizes = 20;
sizes = floor(logspace(3, 4, num_sizes));
times = zeros(num_sizes, 1);

for i = 1:num_sizes
    disp(sizes(i))
    s = 0;
    for repeat = 1:num_repeats
        A = randn(sizes(i), n);
        B = randn(n, n);
        tic
        Q = qr(A, 0);
        C = Q * B;
        s = s + toc;
    end
    times(i) = s / num_repeats;
end

hold on
set(gca, 'XScale', 'log', 'YScale', 'log')
loglog(sizes, times)
loglog(sizes, sizes / 1e5)

The plot I get:

As you see both scale more or less linearly. But MATLAB is much faster.
There are maybe a few things to change in my code for the benchmark to be more fair. Let me know if it’s the case.

1 Like

What do you get from

julia> LinearAlgebra.BLAS.vendor()
:openblas64

?
Matlab uses MKL as BLAS, which is usually much faster than openblas. You can use MKL by installing MKL.jl

2 Likes