Performance Difference in Matrix Multiplication with Symmetric Matrices

Hi everyone,

I have a question regarding the performance of matrix multiplication involving symmetric matrices in Julia. Here’s my code and the results:

let N = 1000
    λ, Γp = [randn(N, N) for _ in 1:2]
    Γp_symm = Symmetric(Γp)
    Γp = Matrix(Γp_symm)
    @assert λ' * Γp * λ ≈ λ' * Γp_symm * λ
    @btime $λ' * $Γp_symm * $λ
    @btime $λ' * $Γp * $λ
    @btime $λ' * ($Γp_symm * $λ)
    @btime $λ' * ($Γp * $λ)
end

The timing results are as follows:

  1. λ' * Γp_symm * λ: 651.877 ms (6 allocations: 15.26 MiB)

  2. λ' * Γp * λ: 13.033 ms (6 allocations: 15.26 MiB)

  3. λ' * (Γp_symm * λ): 12.925 ms (6 allocations: 15.26 MiB)

  4. λ' * (Γp * λ): 12.975 ms (6 allocations: 15.26 MiB)

My questions are:

  1. Why does the multiplication involving the Symmetric type without parentheses become significantly slower compared to the other cases?

  2. Does matrix-matrix multiplication in Julia require parentheses to specify the order explicitly? Adding parentheses seems to slightly speed up the naive multiplication, but it leads to a huge speedup when involving a symmetric matrix.

Here is my Julia version information:

Julia Version 1.11.0
Commit 501a4f25c2b (2024-10-07 11:40 UTC)
Build Info:
  Official release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 256 × Intel(R) Xeon(R) Gold 6448H
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, sapphirerapids)
Threads: 4 default, 0 interactive, 2 GC (on 256 virtual cores)
Environment:
  JULIA_PKG_SERVER = https://mirrors.nju.edu.cn/julia
  JULIA_NUM_THREADS = 4

Any insights would be greatly appreciated!

I tested it with TensorOperations.jl and still find the one with Symmetric tag slower:

using LinearAlgebra, MKL, TensorOperations
let N = 1000
    λ, Γp = [randn(N, N) for _ in 1:2]
    Γp_symm = Symmetric(Γp)
    Γp = Matrix(Γp_symm)
    @tensor tmp1[i, j] := λ[k, i] * Γp[k, l] * λ[l, j]
    @tensor tmp2[i, j] := λ[k, i] * Γp_symm[k, l] * λ[l, j]
    @assert tmp1 ≈ tmp2
    @btime @tensor tmp[i, j] := $λ[k, i] * $Γp[k, l] * $λ[l, j]
    @btime @tensor tmp[i, j] := $λ[k, i] * $Γp_symm[k, l] * $λ[l, j]
end

The results:

 13.132 ms (6 allocations: 15.26 MiB)
 17.767 ms (19 allocations: 38.15 MiB)

Additional informations:

  • CPU: Intel(R) Xeon(R) Gold 6448H
  • BLAS and threading configuration:
BLAS.get_num_threads() = 4
Threads.nthreads() = 4
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)

Hi!
Thank you for your interest! I’m still a beginner and not sure where to start with analyzing this issue. Any guidance would be greatly appreciated!

You can use the @which or @edit macros to see which call is effectively being used. For instance, x'*Asymm*x a three-argument * is being dispatched, whereas with the parenthesis a two-argument * ends up being dispatched. In my system, I don’t see performance differences with or without the parenthesis (julia 1.11.3). Note that for his case, you get 0 allocations and much better performance by calling dot(x,A,x).
EDIT: while using 0 allocations, dot is slower for me (probably due to lack of parallelization)
EDIT2: sorry, misread the first line, this is about product of three matrices, dot is unrelated to this