It seems like the function qr doesn’t take keywords argument to return an economy-sized factorization.
In this thread qr(A; full=false) no longer available? · Issue #27397 · JuliaLang/julia · GitHub they say that the internal representation is efficient so it shouldn’t be a problem. But when it comes to do matrix multiplications with the Q factor I get weird performances.
Here is an example:
function fun1()
A = randn(10000, 500)
B = randn(10000, 500)
Q, R = qr(A)
return Q * B
end
function fun2()
A = randn(10000, 500)
B = randn(500, 500)
Q, R = qr(A)
return Q * B
end
The function fun2 should use the thin QR when evaluating the product Q * B.
But when we time the functions
using BenchmarkTools
@btime fun1();
@btime fun2();
we find that fun2 is as slow as fun1, or even slower because there are more allocations.
So my question would be: how to take advantage of the thin QR decomposition to get better performances?
Side note: If I try to do the multiplication with Q[:, 1:500] * B it is even slower (actually it takes forever – not sure what’s happening here).
Q*Bthin internally zero-extends the n x n matrix Bthin to an m x n matrix Bfull and then computes Q*Bfull, see
That’s why you do not get any speedup in your fun2().
According to the documentation for qr(), you can get the thin Q factor as a dense matrix through Matrix(Q), so Matrix(Q)*Bthin avoids the zero-padding, but it seems that the dense matrix algebra is less efficient than working with the compressed Householder representation:
The type of Q is LinearAlgebra.QRCompactWYQ{Float64,Array{Float64,2}}. And indeed, Matrix(Q) returns a 10000 x 500 matrix. I understand better the internal representations now.
I have the same results as you, ettersi. When I perform the operation Matrix(Q) * Bthin it is as slow, or even slower than Q * Bfull.
Do you think it is something that we can hope to improve? That is, is it an issue with the way Julia builds Matrix(Q)? Or there is some inherent limitation of the QR decomposition that makes the thin decomposition not much faster than the full one?
On second thought, it is not too hard to see why Matrix(Q) * Bthin is slower than Q * Bthin: the QRCompactWYQ data type represents Q as a sequence of Householder reflectors,
This probably also explains why Julia (or LAPACK for that matter) does not provide an “optimised” code for Q*Bthin. The runtime of this operation would still be O(m*n), i.e. it would have the same asymptotic complexity as Q*Bpadded and “only” a slightly better prefactor. Specifically, I believe Q*Bpadded requires roughly 5*m*n additions and multiplications while an optimal implementation of Q*Bthin would require (2m + 3n)*n additions and multiplications.
It seems like that computing the product Q * Bthin is already mostly optimized.
I am just wondering: is the zero padding allocating memory? Or it is something performed lazily and internally Julia knows that there is just a bunch of zeros?
Ok, I think this explains why it doesn’t scale linearly in m (considering A is m x n). It’s slightly more than linear.
When I am trying the same code on MATLAB it does scale linearly if I use the thin QR decomposition.
Is there any chance to rewrite things in a way that would be linear? Maybe with low-level LAPACK functions? I am looking into it but it’s not so clear.
I suspect that Julia failing to be linear in m is actually a sign of how optimised the code is. Specifically, I suspect that the nonlinearity is due to slower memory accesses if you operate on a larger amount of data. If so, the only way to achieve linear scaling would be to slow down the fast processing of small amounts of data which of course would make no sense.
This is just a guess. We might be able to come to more meaningful conclusions if you share the hard data which lead to your conclusion that Matlab scales linearly but Julia doesn’t.
Ah I see what you mean. Yes it wouldn’t make sense. Actually I benchmarked a little more rigorously and here are the results I get.
In Julia:
using LinearAlgebra
using Plots
function fun(A, B)
Q = qr(A).Q
C = Q * B
end
function benchmark(n, sizes, num_repeats)
num_sizes = length(sizes)
times = zeros(num_sizes)
for i = 1:num_sizes
s = 0
for repeat = 1:num_repeats
A = randn(sizes[i], n)
B = randn(n, n)
s += (@timed fun(A, B)).time
end
times[i] = s / num_repeats
end
return times
end
n = 500
num_repeats = 5
num_sizes = 20
sizes = Int.(floor.(exp10.(range(3, stop=4, length=num_sizes))))
times = benchmark(n, sizes, num_repeats)
plot(sizes, times, xaxis=:log, yaxis=:log, label="")
plot!(sizes, sizes ./ 1e5, xaxis=:log, yaxis=:log, label="")
n = 500;
num_repeats = 5;
num_sizes = 20;
sizes = floor(logspace(3, 4, num_sizes));
times = zeros(num_sizes, 1);
for i = 1:num_sizes
disp(sizes(i))
s = 0;
for repeat = 1:num_repeats
A = randn(sizes(i), n);
B = randn(n, n);
tic
Q = qr(A, 0);
C = Q * B;
s = s + toc;
end
times(i) = s / num_repeats;
end
hold on
set(gca, 'XScale', 'log', 'YScale', 'log')
loglog(sizes, times)
loglog(sizes, sizes / 1e5)
As you see both scale more or less linearly. But MATLAB is much faster.
There are maybe a few things to change in my code for the benchmark to be more fair. Let me know if it’s the case.