Linear least-squares solution occasionally much slower than what is achievable using LAPACK based wrappers

The computation of the least-squares solution of linear equations using x = A\b is occasionally substantially slower than the attainable speed using LAPACK based wrappers. For example, let A be a full matrix A = rand(1000,700) and b = rand(1000) and let At be an upper triangular matrix At = triu(A) and A10 a Hessenberg-like matrix A10 = triu(A,-10) with 10 subdiagonals. The run-times on my machine for A\b, At\b and A10\b are of the same order of magnitudes as can be seen below:

n = 1000; m = 700;
A = rand(n,m); A10 = triu(a,-10); At = triu(a); b = rand(n);
@time  A\b;
@time At\b;
@time A10\b;

The timing results are:

julia> @time  A\b;
  0.134719 seconds (5.65 k allocations: 13.607 MiB)
julia> @time At\b;
  0.123878 seconds (5.56 k allocations: 13.576 MiB)
julia> @time A10\b;
  0.121700 seconds (5.65 k allocations: 13.762 MiB)

As can be observed, there are practically no differences in the timing figures for the three matrices.

However, by using an experimental implementation (qrsolve!, see below) of the QR-decomposition based solution approach relying on wrappers to the LAPACK families _LARFG.F and _LARF.F implemeting the computation of Householder transformations and their applications, the results are much better for the matrices At and A10 as can be seen from the following figures:

@time  qrsolve!(A,b);
@time  qrsolve!(At,b);
@time  qrsolve!(A10,b);

for which the timing results are:

julia> @time  qrsolve!(A,b);
  0.118777 seconds (4.91 k allocations: 9.708 MiB)
julia> @time  qrsolve!(At,b);
  0.003154 seconds (4.91 k allocations: 9.708 MiB)
julia> @time  qrsolve!(A10,b);
  0.011451 seconds (4.91 k allocations: 9.708 MiB)

The reason for this pleasant behaviour lies in the implementation of _LARF.F, which automatically discovers trailing zeros in the successive columns of A and applies the transformations using only the rest of nonzero entries. This strategy is not implemented in the function reflectorApply! (which is used for example in qrfactUnblocked!), so even cases when no computations are needed (e.g., A upper triangular), the transformations (all identity matrices) are still fully applied.

The wrappers to the LAPACK families _LARFG.F and _LARF.F are available in the MatrixPencils.jl collection.

Here is the experimental function qrsolve!, implemeted for the particular case of tall matrices.

function qrsolve!(A::AbstractMatrix{T},b::AbstractVector{T}) where {T}
    m, n = size(A) 
    m < n && error("Column dimension exceeds row dimension") 
    for k = 1:n
        x = view(A, k:m, k)
        τk, β = larfg!(x)
        larf!('L', x, τk, view(A, k:m, k + 1:n))
        larf!('L', x, τk, view(b, k:m, 1:1))
        A[k,k] = β
    return UpperTriangular(triu(A[1:n,:]))\b[1:n]