mul! much slower than * for QRCompactWYQ

Suppose I have generated the following QRCompactWYQ:

Q = qr(randn(500, 10)).Q;

and I want to do in-place multiplication of Q with

x = randn(500)

to

y = randn(500)

i.e.

@btime mul!(y, Q, x)

This runs in 2.9s in my machine, which is significantly slower than doing

@btime y = Q*x

which runs in 8.8μs.

The LAPACK routine that does this modifies the right-hand side directly. It is interfaced via lmul!(Q, x). I’m guessing it’s an oversight that the corresponding mul! method is missing.

The mul! function falls back to a generic method, which becomes extremely slow since the elements of Q take time to compute. It should be easy to write the missing method as it just needs to copy x into y and then call lmul!.

Thanks, I will create an issue/PR in Julia’s github repo then.

@nrontsis Have you created an issue/PR about this? If so, could you share a link here?

I ran into this issue for a QRPackedQ from a qr with pivoting.

MWE:

x = rand(64,64);
y = rand(64,64);
z = rand(64,64);
F = qr(x, Val(true));

@time mul!(x, F.Q, y);
@time mul!(x, Matrix(F.Q), y);
@time mul!(x, y, y);

# even worse if we want to use an adjoint
@time mul!(x, F.Q', y);
@time mul!(x, Matrix(F.Q)', y);
@time mul!(x, y', y);

which on my machine gives

0.318706 seconds (32.79 k allocations: 273.067 MiB, 7.32% gc time)
0.000707 seconds (9 allocations: 80.922 KiB)
0.000030 seconds (4 allocations: 160 bytes)

# adjoint
10.129926 seconds (1.08 M allocations: 8.798 GiB, 6.68% gc time)
0.000769 seconds (10 allocations: 80.938 KiB)
0.000039 seconds (5 allocations: 176 bytes)

(typeof(F.Q) == LinearAlgebra.QRPackedQ{Float64,Array{Float64,2}})

EDIT: Critical slowdown of `mul!` when called with a `QRPackedQ` · Issue #31146 · JuliaLang/julia · GitHub

Thanks for reminding me about this. I have just opened an issue.