Gemm! vs symm! performance

Hi!

I try to compare gemm! or symm! performance and get these results for small matrices. gemm! have much better performance, when to use symm!?

C = rand(10, 10)
A = rand(10, 10)
B = rand(10, 10)
julia> @benchmark LinearAlgebra.BLAS.symm!('R', 'U', true, A, B, true, C)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  36.900 μs … 110.900 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     42.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.492 μs ±   3.949 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▁▂▅▆▇▃██▂▂
  ▁▁▁▁▁▂▃▃▅▇███████████▅▃▃▃▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  36.9 μs         Histogram: frequency by time         59.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark LinearAlgebra.BLAS.gemm!('N', 'N', true, A, B, true, C)
BenchmarkTools.Trial: 10000 samples with 417 evaluations.
 Range (min … max):  251.799 ns … 459.712 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     264.029 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   264.118 ns ±   4.118 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                             █▂▂▂▁
  ▂▁▂▂▂▁▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▅▆▆▇███████▇▆▅▇▄▃▃▂▂▂▂▂ ▃
  252 ns           Histogram: frequency by time          268 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 
1 Like

BLAS functions tend to be optimized for large matrices. For very small matrices, StaticArrays are faster, at least if the size of the matrix is known at compile-time:

julia> using LinearAlgebra, StaticArrays, BenchmarkTools

julia> C,A,B = rand(10,10), rand(10,10), rand(10,10);

julia> @btime LinearAlgebra.BLAS.gemm!('N', 'N', true, $A, $B, true, $C);
  301.444 ns (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.BLAS.symm!('R', 'U', true, $A, $B, true, $C);
  110.352 μs (0 allocations: 0 bytes)

julia> As,Bs = SMatrix{10,10}(A), SMatrix{10,10}(B);

julia> @btime $As * $Bs;
  85.477 ns (0 allocations: 0 bytes)

(You could probably implement an even faster method for Symmetric{<:SMatrix} multiplication too, but I don’t think this exists right now.)

Such a large difference between symm! and gemm! is too unreasonable. Actually, I cannot reproduce your result. On my machine, the difference is only ~5% for your setup.

My env:

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)

I get similar measurements as @stevengj and @PharmCat:

julia> using LinearAlgebra, StaticArrays, BenchmarkTools

julia> C,A,B = rand(10,10), rand(10,10), rand(10,10);

julia> @btime LinearAlgebra.BLAS.gemm!('N', 'N', true, $A, $B, true, $C);
  189.628 ns (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.BLAS.symm!('R', 'U', true, $A, $B, true, $C);
  41.400 μs (0 allocations: 0 bytes)

julia> As,Bs = SMatrix{10,10}(A), SMatrix{10,10}(B);

julia> @btime $As * $Bs;
  50.557 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d739 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 20 on 20 virtual cores

Seems it’s a bug for a newer CPU or newer BLAS version.

My versioninfo:

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 16 on 32 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 16

I think you should file this issue at GitHub - xianyi/OpenBLAS: OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.. It looks like there is a performance issue with zsymm for your architecture.

1 Like