Why is Matrix multiplication with an UpperTriangular slower than with normal Matrix?

Why is Matrix multiplication with an UpperTriangular slower than with normal Matrix?
I would expect it to be faster:

using LinearAlgebra, BenchmarkTools
A = randn(60,60); B = randn(60,60);
@btime $A * $B' # 19.693 μs (2 allocations: 28.20 KiB)
@btime $(UpperTriangular(A)) * $B' # 21.305 μs (4 allocations: 56.41 KiB)

When the transpose of B is omitted, it is the other way round:

@btime $A * $B # 18.578 μs (2 allocations: 28.20 KiB)
@btime $(UpperTriangular(A)) * $B # 16.805 μs (2 allocations: 28.20 KiB)

I would first check with the @which macro which method is actually getting called in each case. It’s possible that there is no specialized implementation for UpperTriangular * Transpose yet

trmm is implemented in BLAS/LAPACK. If for some reason we weren’t dispatching to it, it’d be many times slower instead of being in the same ballpark.
I’ll try and get LoopVectorization to support this by the end of the year. Will be interesting to see how it performs then.

Asymptotically, I’d expect it to be 2x faster. At small sizes, you have overhead of dealing with the triangular part.

I’d have expected B * LowerTriangular(A') to be easier to optimize, but transposing the output would also have some overhead.
I see a much larger difference (using MKL):

julia> using LinearAlgebra, BenchmarkTools

julia> A = randn(60,60); B = randn(60,60);

julia> @benchmark $A * $B'
BechmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  3.907 μs … 79.820 μs  ┊ GC (min … max): 0.00% … 86.60%
 Time  (median):     4.229 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.642 μs ±  3.468 μs  ┊ GC (mean ± σ):  4.22% ±  5.60%

  ▃██▄▂                                                      ▂
  █████▇▅▄▁▃▁▁▁▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▃▄▅▆▆▆▇▇▇█▇ █
  3.91 μs      Histogram: log(frequency) by time     13.4 μs <

 Memory estimate: 28.20 KiB, allocs estimate: 2.

julia> @benchmark $(UpperTriangular(A)) * $B'
BechmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   8.580 μs … 258.670 μs  ┊ GC (min … max): 0.00% … 84.64%
 Time  (median):      9.999 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.031 μs ±  13.739 μs  ┊ GC (mean ± σ):  7.06% ±  5.45%

      ▅▇▄▇█
  ▃▅▆███████▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▂▁▁▁▁▂▁▂▁▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂ ▃
  8.58 μs         Histogram: frequency by time         21.6 μs <

 Memory estimate: 56.41 KiB, allocs estimate: 4.

julia> @benchmark $A * $B
BechmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.824 μs … 75.079 μs  ┊ GC (min … max): 0.00% … 89.37%
 Time  (median):     4.218 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.443 μs ±  3.163 μs  ┊ GC (mean ± σ):  4.68% ±  6.01%

                        ▁▂▄▅▆▇▆██▇▆▅▃▂
  ▂▁▂▂▂▂▂▂▂▂▃▃▃▃▄▄▄▅▅▆▇█████████████████▆▆▅▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂ ▄
  3.82 μs        Histogram: frequency by time         4.6 μs <

 Memory estimate: 28.20 KiB, allocs estimate: 2.

julia> @benchmark UpperTriangular($A) * $B
BechmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.209 μs … 136.078 μs  ┊ GC (min … max): 0.00% … 85.47%
 Time  (median):     6.243 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.709 μs ±   7.580 μs  ┊ GC (mean ± σ):  6.40% ±  5.39%

                 ▃▃▅▅▆▇▇██▇▅▂
  ▂▂▃▄▅▆▆▅▅▄▄▅▅▇█████████████▇▆▄▅▅▅▅▆▅▄▄▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂ ▄
  5.21 μs         Histogram: frequency by time         880 μs <

 Memory estimate: 28.20 KiB, allocs estimate: 2.
5 Likes