Hi!
I try to compare gemm! or symm! performance and get these results for small matrices. gemm! have much better performance, when to use symm!?
C = rand(10, 10)
A = rand(10, 10)
B = rand(10, 10)
julia> @benchmark LinearAlgebra.BLAS.symm!('R', 'U', true, A, B, true, C)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 36.900 μs … 110.900 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 42.800 μs ┊ GC (median): 0.00%
Time (mean ± σ): 43.492 μs ± 3.949 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▅▆▇▃██▂▂
▁▁▁▁▁▂▃▃▅▇███████████▅▃▃▃▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
36.9 μs Histogram: frequency by time 59.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark LinearAlgebra.BLAS.gemm!('N', 'N', true, A, B, true, C)
BenchmarkTools.Trial: 10000 samples with 417 evaluations.
Range (min … max): 251.799 ns … 459.712 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 264.029 ns ┊ GC (median): 0.00%
Time (mean ± σ): 264.118 ns ± 4.118 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▂▂▁
▂▁▂▂▂▁▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▅▆▆▇███████▇▆▅▇▄▃▃▂▂▂▂▂ ▃
252 ns Histogram: frequency by time 268 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia>
1 Like
BLAS functions tend to be optimized for large matrices. For very small matrices, StaticArrays are faster, at least if the size of the matrix is known at compile-time:
julia> using LinearAlgebra, StaticArrays, BenchmarkTools
julia> C,A,B = rand(10,10), rand(10,10), rand(10,10);
julia> @btime LinearAlgebra.BLAS.gemm!('N', 'N', true, $A, $B, true, $C);
301.444 ns (0 allocations: 0 bytes)
julia> @btime LinearAlgebra.BLAS.symm!('R', 'U', true, $A, $B, true, $C);
110.352 μs (0 allocations: 0 bytes)
julia> As,Bs = SMatrix{10,10}(A), SMatrix{10,10}(B);
julia> @btime $As * $Bs;
85.477 ns (0 allocations: 0 bytes)
(You could probably implement an even faster method for Symmetric{<:SMatrix}
multiplication too, but I don’t think this exists right now.)
photor
December 28, 2022, 8:54am
3
Such a large difference between symm! and gemm! is too unreasonable. Actually, I cannot reproduce your result. On my machine, the difference is only ~5% for your setup.
photor
December 28, 2022, 8:58am
4
My env:
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
uniment
December 28, 2022, 9:05am
5
I get similar measurements as @stevengj and @PharmCat :
julia> using LinearAlgebra, StaticArrays, BenchmarkTools
julia> C,A,B = rand(10,10), rand(10,10), rand(10,10);
julia> @btime LinearAlgebra.BLAS.gemm!('N', 'N', true, $A, $B, true, $C);
189.628 ns (0 allocations: 0 bytes)
julia> @btime LinearAlgebra.BLAS.symm!('R', 'U', true, $A, $B, true, $C);
41.400 μs (0 allocations: 0 bytes)
julia> As,Bs = SMatrix{10,10}(A), SMatrix{10,10}(B);
julia> @btime $As * $Bs;
50.557 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d739 (2022-11-15 14:37 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
Threads: 20 on 20 virtual cores
photor
December 28, 2022, 9:08am
6
Seems it’s a bug for a newer CPU or newer BLAS version.
uniment:
versioninfo()
My versioninfo:
julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
Threads: 16 on 32 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 16
I think you should file this issue at GitHub - xianyi/OpenBLAS: OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. . It looks like there is a performance issue with zsymm for your architecture.
1 Like