mul!(::Matrix, ::Matrix, ::UpperTriangular) is really slow

There seem to be efficient methods to do mul!(out, ::UpperTriangular, ::Matrix) and mul!(out, ::LowerTriangular, ::Matrix). However, when you reverse the arguments and use mul!(out, ::Matrix, ::UpperTriangular) the performance is much much worse! It’s much worse than just using mul!(out, ::Matrix, ::Matrix).

Benchmarks

using LinearAlgebra
using BenchmarkTools

n = 1000

prealloc = Matrix{Float64}(undef, n, n)

@benchmark mul!($prealloc, m, v) setup=begin
    m = rand(n, n)
    v = rand(n, n)
end  # 12 ms

@benchmark mul!($prealloc, m, v) setup=begin
    m = UpperTriangular(rand(n, n))
    v = rand(n, n)
end  # 8 ms

@benchmark mul!($prealloc, m, v) setup=begin
    m = LowerTriangular(rand(n, n))
    v = rand(n, n)
end  # 8 ms

@benchmark mul!($prealloc, m, v) setup=begin
    m = rand(n, n)
    v = UpperTriangular(rand(n, n))
end  # 800 ms

@benchmark mul!($prealloc, m, v) setup=begin
    m = rand(n, n)
    v = LowerTriangular(rand(n, n))
end  # 800 ms

This is a bug, right?

The fast methods use BLAS calls while the slow ones (where v is the triangular matrix) fall back to the generic matmul algorithm. It shouldn’t be difficult to add methods to dispatch them to BLAS as well.

Maybe this helps in the meantime:

@btime mul!($prealloc, m', v')' setup=begin
    m = UpperTriangular(rand(n, n))
    v = rand(n, n)
end  

@btime mul!($prealloc, m', v')' setup=begin
    m = LowerTriangular(rand(n, n))
    v = rand(n, n)
end  

I thought about that as well, but then I’m no longer returning prealloc, but the adjoint of prealloc, which isn’t so clean. I’d prefer to pay the slight performance cost and just not annotate my matrix with UpperTriangular until this gets fixed.

Issue created: https://github.com/JuliaLang/julia/issues/42550

Until this is fixed, you can use the allocating * instead of mul!, this actually uses BLAS and is indeed fast.