 # mul!(::Matrix, ::Matrix, ::UpperTriangular) is really slow

There seem to be efficient methods to do `mul!(out, ::UpperTriangular, ::Matrix)` and `mul!(out, ::LowerTriangular, ::Matrix)`. However, when you reverse the arguments and use `mul!(out, ::Matrix, ::UpperTriangular)` the performance is much much worse! It’s much worse than just using `mul!(out, ::Matrix, ::Matrix)`.

Benchmarks

``````using LinearAlgebra
using BenchmarkTools

n = 1000

prealloc = Matrix{Float64}(undef, n, n)

@benchmark mul!(\$prealloc, m, v) setup=begin
m = rand(n, n)
v = rand(n, n)
end  # 12 ms

@benchmark mul!(\$prealloc, m, v) setup=begin
m = UpperTriangular(rand(n, n))
v = rand(n, n)
end  # 8 ms

@benchmark mul!(\$prealloc, m, v) setup=begin
m = LowerTriangular(rand(n, n))
v = rand(n, n)
end  # 8 ms

@benchmark mul!(\$prealloc, m, v) setup=begin
m = rand(n, n)
v = UpperTriangular(rand(n, n))
end  # 800 ms

@benchmark mul!(\$prealloc, m, v) setup=begin
m = rand(n, n)
v = LowerTriangular(rand(n, n))
end  # 800 ms
``````

This is a bug, right?

The fast methods use BLAS calls while the slow ones (where `v` is the triangular matrix) fall back to the generic matmul algorithm. It shouldn’t be difficult to add methods to dispatch them to BLAS as well.

Maybe this helps in the meantime:

``````@btime mul!(\$prealloc, m', v')' setup=begin
m = UpperTriangular(rand(n, n))
v = rand(n, n)
end

@btime mul!(\$prealloc, m', v')' setup=begin
m = LowerTriangular(rand(n, n))
v = rand(n, n)
end
``````

I thought about that as well, but then I’m no longer returning `prealloc`, but the adjoint of `prealloc`, which isn’t so clean. I’d prefer to pay the slight performance cost and just not annotate my matrix with `UpperTriangular` until this gets fixed.

Until this is fixed, you can use the allocating `*` instead of `mul!`, this actually uses BLAS and is indeed fast.