mul! much slower than * for QRCompactWYQ

Suppose I have generated the following QRCompactWYQ:

``````Q = qr(randn(500, 10)).Q;
``````

and I want to do in-place multiplication of Q with

``````x = randn(500)
``````

to

``````y = randn(500)
``````

i.e.

``````@btime mul!(y, Q, x)
``````

This runs in 2.9s in my machine, which is significantly slower than doing

``````@btime y = Q*x
``````

which runs in 8.8μs.

The LAPACK routine that does this modifies the right-hand side directly. It is interfaced via `lmul!(Q, x)`. I’m guessing it’s an oversight that the corresponding `mul!` method is missing.

The `mul!` function falls back to a generic method, which becomes extremely slow since the elements of Q take time to compute. It should be easy to write the missing method as it just needs to copy `x` into `y` and then call `lmul!`.

Thanks, I will create an issue/PR in Julia’s github repo then.

I ran into this issue for a `QRPackedQ` from a `qr` with pivoting.

MWE:

``````x = rand(64,64);
y = rand(64,64);
z = rand(64,64);
F = qr(x, Val(true));

@time mul!(x, F.Q, y);
@time mul!(x, Matrix(F.Q), y);
@time mul!(x, y, y);

# even worse if we want to use an adjoint
@time mul!(x, F.Q', y);
@time mul!(x, Matrix(F.Q)', y);
@time mul!(x, y', y);
``````

which on my machine gives

``````0.318706 seconds (32.79 k allocations: 273.067 MiB, 7.32% gc time)
0.000707 seconds (9 allocations: 80.922 KiB)
0.000030 seconds (4 allocations: 160 bytes)

10.129926 seconds (1.08 M allocations: 8.798 GiB, 6.68% gc time)
0.000769 seconds (10 allocations: 80.938 KiB)
0.000039 seconds (5 allocations: 176 bytes)
``````

(`typeof(F.Q) == LinearAlgebra.QRPackedQ{Float64,Array{Float64,2}}`)