Why is Matrix multiplication with an UpperTriangular slower than with normal Matrix?

zsoerenm · June 30, 2021, 12:26pm

Why is Matrix multiplication with an UpperTriangular slower than with normal Matrix?
I would expect it to be faster:

using LinearAlgebra, BenchmarkTools
A = randn(60,60); B = randn(60,60);
@btime $A * $B' # 19.693 μs (2 allocations: 28.20 KiB)
@btime $(UpperTriangular(A)) * $B' # 21.305 μs (4 allocations: 56.41 KiB)

When the transpose of B is omitted, it is the other way round:

@btime $A * $B # 18.578 μs (2 allocations: 28.20 KiB)
@btime $(UpperTriangular(A)) * $B # 16.805 μs (2 allocations: 28.20 KiB)

josuagrw · June 30, 2021, 12:28pm

I would first check with the @which macro which method is actually getting called in each case. It’s possible that there is no specialized implementation for UpperTriangular * Transpose yet

Elrod · June 30, 2021, 1:14pm

trmm is implemented in BLAS/LAPACK. If for some reason we weren’t dispatching to it, it’d be many times slower instead of being in the same ballpark.
I’ll try and get LoopVectorization to support this by the end of the year. Will be interesting to see how it performs then.

Asymptotically, I’d expect it to be 2x faster. At small sizes, you have overhead of dealing with the triangular part.

I’d have expected B * LowerTriangular(A') to be easier to optimize, but transposing the output would also have some overhead.
I see a much larger difference (using MKL):

julia> using LinearAlgebra, BenchmarkTools

julia> A = randn(60,60); B = randn(60,60);

julia> @benchmark $A * $B'
BechmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  3.907 μs … 79.820 μs  ┊ GC (min … max): 0.00% … 86.60%
 Time  (median):     4.229 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.642 μs ±  3.468 μs  ┊ GC (mean ± σ):  4.22% ±  5.60%

  ▃██▄▂                                                      ▂
  █████▇▅▄▁▃▁▁▁▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▃▄▅▆▆▆▇▇▇█▇ █
  3.91 μs      Histogram: log(frequency) by time     13.4 μs <

 Memory estimate: 28.20 KiB, allocs estimate: 2.

julia> @benchmark $(UpperTriangular(A)) * $B'
BechmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   8.580 μs … 258.670 μs  ┊ GC (min … max): 0.00% … 84.64%
 Time  (median):      9.999 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.031 μs ±  13.739 μs  ┊ GC (mean ± σ):  7.06% ±  5.45%

      ▅▇▄▇█
  ▃▅▆███████▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▂▁▁▁▁▂▁▂▁▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂ ▃
  8.58 μs         Histogram: frequency by time         21.6 μs <

 Memory estimate: 56.41 KiB, allocs estimate: 4.

julia> @benchmark $A * $B
BechmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.824 μs … 75.079 μs  ┊ GC (min … max): 0.00% … 89.37%
 Time  (median):     4.218 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.443 μs ±  3.163 μs  ┊ GC (mean ± σ):  4.68% ±  6.01%

                        ▁▂▄▅▆▇▆██▇▆▅▃▂
  ▂▁▂▂▂▂▂▂▂▂▃▃▃▃▄▄▄▅▅▆▇█████████████████▆▆▅▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂ ▄
  3.82 μs        Histogram: frequency by time         4.6 μs <

 Memory estimate: 28.20 KiB, allocs estimate: 2.

julia> @benchmark UpperTriangular($A) * $B
BechmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.209 μs … 136.078 μs  ┊ GC (min … max): 0.00% … 85.47%
 Time  (median):     6.243 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.709 μs ±   7.580 μs  ┊ GC (mean ± σ):  6.40% ±  5.39%

                 ▃▃▅▅▆▇▇██▇▅▂
  ▂▂▃▄▅▆▆▅▅▄▄▅▅▇█████████████▇▆▄▅▅▅▅▆▅▄▄▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂ ▄
  5.21 μs         Histogram: frequency by time         880 μs <

 Memory estimate: 28.20 KiB, allocs estimate: 2.

Topic		Replies	Views
mul!(::Matrix, ::Matrix, ::UpperTriangular) is really slow Performance linearalgebra	4	423	October 8, 2021
Any reason to specify triu()? Performance	1	90	March 25, 2025
Element wise multiplication of lower triangular matrices New to Julia linearalgebra	17	969	August 29, 2023
Inverse of a triangular matrix Performance linearalgebra	2	1046	November 5, 2020
Why `LAPACK.trtrs!`, not `BLAS.trsv!`? Internals & Design question	15	1793	September 3, 2018

Why is Matrix multiplication with an UpperTriangular slower than with normal Matrix?

Related topics