# Economy-sized QR decomposition

Hello,

It seems like the function `qr` doesn’t take keywords argument to return an economy-sized factorization.
In this thread qr(A; full=false) no longer available? · Issue #27397 · JuliaLang/julia · GitHub they say that the internal representation is efficient so it shouldn’t be a problem. But when it comes to do matrix multiplications with the Q factor I get weird performances.

Here is an example:

``````function fun1()
A = randn(10000, 500)
B = randn(10000, 500)
Q, R = qr(A)
return Q * B
end

function fun2()
A = randn(10000, 500)
B = randn(500, 500)
Q, R = qr(A)
return Q * B
end
``````

The function `fun2` should use the thin QR when evaluating the product `Q * B`.
But when we time the functions

``````using BenchmarkTools
@btime fun1();
@btime fun2();
``````

we find that `fun2` is as slow as `fun1`, or even slower because there are more allocations.

So my question would be: how to take advantage of the thin QR decomposition to get better performances?

Side note: If I try to do the multiplication with `Q[:, 1:500] * B` it is even slower (actually it takes forever – not sure what’s happening here).

What is the `typeof(Q) `? Maybe you need to do `Matrix(Q) ` to get an ordinary matrix which is fast to multiply by. The Q returned by qr might be lazy.

Possibly related

4 Likes

`Q*Bthin` internally zero-extends the `n x n` matrix `Bthin` to an `m x n` matrix `Bfull` and then computes `Q*Bfull`, see

That’s why you do not get any speedup in your `fun2()`.

According to the documentation for `qr()`, you can get the thin `Q` factor as a dense matrix through `Matrix(Q)`, so `Matrix(Q)*Bthin` avoids the zero-padding, but it seems that the dense matrix algebra is less efficient than working with the compressed Householder representation:

``````julia> A = rand(10000,500)
Bfull = rand(10000,500)
Bthin = rand(500,500)
Q,R = qr(A)

@time Q*Bfull
@time Matrix(Q)*Bthin
;
0.184003 seconds (4 allocations: 40.054 MiB)
0.300148 seconds (8 allocations: 116.349 MiB, 5.89% gc time)
``````
3 Likes

Thank you for your replies.

The type of `Q` is `LinearAlgebra.QRCompactWYQ{Float64,Array{Float64,2}}`. And indeed, `Matrix(Q)` returns a `10000 x 500` matrix. I understand better the internal representations now.

I have the same results as you, ettersi. When I perform the operation `Matrix(Q) * Bthin` it is as slow, or even slower than `Q * Bfull`.
Do you think it is something that we can hope to improve? That is, is it an issue with the way Julia builds `Matrix(Q)`? Or there is some inherent limitation of the `QR` decomposition that makes the thin decomposition not much faster than the full one?

On second thought, it is not too hard to see why `Matrix(Q) * Bthin` is slower than `Q * Bthin`: the `QRCompactWYQ` data type represents `Q` as a sequence of Householder reflectors,

Q = \prod_{k = 1}^n (I - 2 \, u_k u_k^T) .

Therefore:

• `Q * Bthin = Q * [Bthin; zeros(m-n,n)]` requires `O(m*n)` operations.
• `Matrix(Q) * Bthin` requires `O(m*n^2)` operations.

This probably also explains why Julia (or LAPACK for that matter) does not provide an “optimised” code for `Q*Bthin`. The runtime of this operation would still be `O(m*n)`, i.e. it would have the same asymptotic complexity as `Q*Bpadded` and “only” a slightly better prefactor. Specifically, I believe `Q*Bpadded` requires roughly `5*m*n` additions and multiplications while an optimal implementation of `Q*Bthin` would require `(2m + 3n)*n` additions and multiplications.

4 Likes

I see, thank you that’s very helpful!

It seems like that computing the product `Q * Bthin` is already mostly optimized.

I am just wondering: is the zero padding allocating memory? Or it is something performed lazily and internally Julia knows that there is just a bunch of zeros?

No, the zero padding allocates memory.

Ok, I think this explains why it doesn’t scale linearly in `m` (considering `A` is `m x n`). It’s slightly more than linear.

When I am trying the same code on MATLAB it does scale linearly if I use the thin QR decomposition.

Is there any chance to rewrite things in a way that would be linear? Maybe with low-level LAPACK functions? I am looking into it but it’s not so clear.

I suspect that Julia failing to be linear in `m` is actually a sign of how optimised the code is. Specifically, I suspect that the nonlinearity is due to slower memory accesses if you operate on a larger amount of data. If so, the only way to achieve linear scaling would be to slow down the fast processing of small amounts of data which of course would make no sense.

This is just a guess. We might be able to come to more meaningful conclusions if you share the hard data which lead to your conclusion that Matlab scales linearly but Julia doesn’t.

Ah I see what you mean. Yes it wouldn’t make sense. Actually I benchmarked a little more rigorously and here are the results I get.

In Julia:

``````using LinearAlgebra
using Plots

function fun(A, B)
Q = qr(A).Q
C = Q * B
end

function benchmark(n, sizes, num_repeats)
num_sizes = length(sizes)
times = zeros(num_sizes)
for i = 1:num_sizes
s = 0
for repeat = 1:num_repeats
A = randn(sizes[i], n)
B = randn(n, n)
s += (@timed fun(A, B)).time
end
times[i] = s / num_repeats
end
return times
end

n = 500
num_repeats = 5
num_sizes = 20
sizes = Int.(floor.(exp10.(range(3, stop=4, length=num_sizes))))

times = benchmark(n, sizes, num_repeats)
plot(sizes, times, xaxis=:log, yaxis=:log, label="")
plot!(sizes, sizes ./ 1e5, xaxis=:log, yaxis=:log, label="")
``````

The plot I get:

In MATLAB:

``````n = 500;
num_repeats = 5;
num_sizes = 20;
sizes = floor(logspace(3, 4, num_sizes));
times = zeros(num_sizes, 1);

for i = 1:num_sizes
disp(sizes(i))
s = 0;
for repeat = 1:num_repeats
A = randn(sizes(i), n);
B = randn(n, n);
tic
Q = qr(A, 0);
C = Q * B;
s = s + toc;
end
times(i) = s / num_repeats;
end

hold on
set(gca, 'XScale', 'log', 'YScale', 'log')
loglog(sizes, times)
loglog(sizes, sizes / 1e5)
``````

The plot I get:

As you see both scale more or less linearly. But MATLAB is much faster.
There are maybe a few things to change in my code for the benchmark to be more fair. Let me know if it’s the case.

1 Like

What do you get from

``````julia> LinearAlgebra.BLAS.vendor()
:openblas64
``````

?
Matlab uses MKL as BLAS, which is usually much faster than openblas. You can use MKL by installing MKL.jl

2 Likes