Product of two symmetric matrices: LoopVectorization.jl vs LinearAlgebra

fipelle · August 31, 2021, 1:53am

Hi,

I am playing around with LoopVectorization.jl. In doing so, I have noticed that it seems to be about 1.5x faster than LinearAlgebra in computing the product between two symmetric matrices (while being about as fast in computing the product between symmetric and non-symmetric matrices). Is this expected behaviour?

I have written down a small example to highlight this point.

# Packages
using BenchmarkTools, LinearAlgebra, LoopVectorization, Random;

@inline function A_mul_B(A::AbstractMatrix{T}, B::AbstractMatrix{T}) where T<:Real

    # Memory pre-allocation
    C = zeros(size(A, 1), size(B, 2));

    # Compute matrix product
    @tturbo for j in axes(B, 2)
        for i in axes(A, 1)
            C_ij = zero(T)
            for k in axes(A, 2)
                C_ij += A[k,i]*B[k,j];
            end
            C[i,j] = C_ij;
        end
    end
    return C;
end

A_mul_B(A::Symmetric{T, Matrix{T}}, B::AbstractMatrix{T}) where T<:Real = A_mul_B(Array(A), B);
A_mul_B(A::Symmetric{T, Matrix{T}}, B::Symmetric{T, Matrix{T}}) where T<:Real = A_mul_B(Array(A), Array(B));

The benchmarks below show that the advantage is visible only with A*C.

Random.seed!(1);
A = Symmetric(randn(100,100));
B = randn(100,100);
C = Symmetric(randn(100,100));

# Symmetric times non-symmetric
@btime Array($A)*$B;
@btime $A*$B;
@btime A_mul_B($A,$B);

# Symmetric times symmetric
@btime $A*$C;
@btime A_mul_B($A,$C);

In my testing environment both Threads.nthreads() and BLAS.get_num_threads() are equal to 4. Array(...) slow things down a bit, but I could not find a better option to get the symmetric matrix in array format.

lungben · August 31, 2021, 9:37am

Just a minor off-topic remark: this defines a Float64 varialbe. If T is not Float64, this may cause a performance loss.
Better use c_ij = zero(T)

lmiq · August 31, 2021, 9:50am

Take a look at the benchmarks here: GitHub - JuliaLinearAlgebra/Octavian.jl: Multi-threaded BLAS-like library that provides pure Julia matrix multiplication

fipelle · August 31, 2021, 12:08pm

Thanks! Sorry - I was testing it with Float64 matrices and I forgot to put it back before posting it. I have edited the original post now.

fipelle · August 31, 2021, 12:55pm

Thank you. I have to take a proper look at it, but at first glance it does not seem to explain why my simple @tturbo implementation performs better than the default version in LinearAlgebra.

My guess is that the LinearAlgebra implementation either does not use a specialised routine for the product of two symmetric matrices (i.e., C_ij += A[k,i]*B[k,j] is replaced with the standard C_ij += A[i,k]*B[k,j]) or there are some small bottlenecks.

fipelle · August 31, 2021, 1:51pm

I do not think it only depends on the column-major order. Indeed, I have re-written A_mul_B replacing C_ij += A[k,i]*B[k,j] with C_ij += A[i,k]*B[k,j] and called it A_mul_B_standard. The benchmarks below show that A_mul_B is only marginally faster than A_mul_B_standard. However, they are both much faster than the LinearAlgebra implementation.

julia> @btime $A*$C;
77.714 μs (4 allocations: 156.41 KiB)

julia> @btime A_mul_B($A,$C);
43.537 μs (7 allocations: 234.62 KiB)

julia> @btime A_mul_B_standard($A,$C);
45.884 μs (7 allocations: 234.62 KiB)

Elrod · August 31, 2021, 3:26pm

Interesting, I’d have expected standard to be faster. Running a few benchmarks, and I get that the standard version is faster single threaded, but maybe a little slower multithreaded.
The standard version is easier to/more SIMD, so it does well single threaded.
The standard version is a little harder to chop into pieces and divide into threads, so when threading overhead and memory transfer dominates, being a little slower makes sense.

In multithreading, LoopVectorization will cut C into pieces, and then have different threads evaluate the pieces. With the non-standard version, these pieces of C correspond to nice contiguous chunks of A and B, hence the multithreading friendliness.

Multithreaded benchmark results:

julia> A = Symmetric(randn(100,100));

julia> B = randn(100,100);

julia> C = Symmetric(randn(100,100));

julia> @btime Array($A)*$B;
  17.295 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$B;
  6.208 μs (2 allocations: 78.17 KiB)

julia> @btime A_mul_B($A,$B);
  21.938 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B_standard($A,$B);
  21.482 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$C;
  29.089 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B($A,$C);
  33.940 μs (6 allocations: 234.52 KiB)

julia> @btime A_mul_B_standard($A,$C);
  32.722 μs (6 allocations: 234.52 KiB)

Single threaded:

julia> @btime Array($A)*$B;
  28.508 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$B;
  23.684 μs (2 allocations: 78.17 KiB)

julia> @btime A_mul_B($A,$B);
  43.515 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B_standard($A,$B);
  32.468 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$C;
  39.691 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B($A,$C);
  52.542 μs (6 allocations: 234.52 KiB)

julia> @btime A_mul_B_standard($A,$C);
  41.485 μs (6 allocations: 234.52 KiB)

Also worth pointing out that the copies took a significant chunk of the time, and that I’m using MKL:

julia> @benchmark Array($A)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   7.964 μs … 364.126 μs  ┊ GC (min … max):  0.00% … 91.63%
 Time  (median):      8.588 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   10.444 μs ±  24.121 μs  ┊ GC (mean ± σ):  16.35% ±  6.87%

        ▂▄▆█████▇▇▆▅▅▃▃▂▁▁                                     ▃
  ▄▆████████████████████████▇█████████▇████▇▇▆▆▆▆▆▅▄▄▄▅▄▄▂▃▆▅▆ █
  7.96 μs       Histogram: log(frequency) by time      10.9 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

julia> BLAS.get_config().loaded_libs
1-element Vector{LinearAlgebra.BLAS.LBTLibraryInfo}:
 LBTLibraryInfo(libmkl_rt.so, ilp64)

If you wanted to optimize the symmetric multiplication, I would divide the matrices into blocks, and then perform the multiplication blockwise. For non-diagonal blocks, these will always be dense.

Diagonal blocks will be trickier.

fipelle · August 31, 2021, 4:09pm

Thank you! I am not a SIMD expert, so I am not sure I am following everything you are saying. For clarity, I will refer to A_mul_B_standard and A_mul_B as the standard and non-standard versions.

Let me post the same benchmarks, as they show on my side. Note that I am not using MKL and I am working on a MacBook Air (Intel).

Single threaded

julia> @btime Array($A)*$B;
  89.299 μs (4 allocations: 156.41 KiB)

julia> @btime $A*$B;
  71.509 μs (2 allocations: 78.20 KiB)

julia> @btime A_mul_B($A,$B);
  69.120 μs (4 allocations: 156.41 KiB)

julia> @btime A_mul_B_standard($A,$B);
  65.145 μs (4 allocations: 156.41 KiB)

julia> @btime $A*$C;
  84.204 μs (4 allocations: 156.41 KiB)

julia> @btime A_mul_B($A,$C);
  77.258 μs (6 allocations: 234.61 KiB)

julia> @btime A_mul_B_standard($A,$C);
  72.635 μs (6 allocations: 234.61 KiB)

4 threads (Julia and BLAS)

julia> @btime Array($A)*$B;
  40.050 μs (4 allocations: 156.41 KiB)

julia> @btime $A*$B;
  41.270 μs (2 allocations: 78.20 KiB)

julia> @btime A_mul_B($A,$B);
  33.185 μs (5 allocations: 156.42 KiB)

julia> @btime A_mul_B_standard($A,$B);
  34.400 μs (5 allocations: 156.42 KiB)

julia> @btime $A*$C;
  77.636 μs (4 allocations: 156.41 KiB)

julia> @btime A_mul_B($A,$C);
  43.363 μs (7 allocations: 234.62 KiB)

julia> @btime A_mul_B_standard($A,$C);
  43.969 μs (7 allocations: 234.62 KiB)

julia> @benchmark Array($A)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   5.991 μs …  1.041 ms  ┊ GC (min … max):  0.00% … 95.33%
 Time  (median):     39.353 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   46.585 μs ± 75.854 μs  ┊ GC (mean ± σ):  15.25% ±  8.92%

                             ▁▆█▆▄▂
  ▃▃▂▁▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▅███████▆▅▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
  5.99 μs         Histogram: frequency by time        70.7 μs <

 Memory estimate: 78.20 KiB, allocs estimate: 2.

This is already surprising to me, since it looks quite different (in relative terms wrt the LinearAlgebra implementation) with your benchmarks. Do you think it depends on MKL? I have not tried it since it did not install correctly on my machine. Also, I recon that the machine I am currently working on is slow, but this should not impact the relative speed of A_mul_B wrt *.

I might try this at some point. For diagonal matrices and other “sorted” sparse matrices I have written a nice code based on @tturbo. However, it is not runnable right now, since the indexing of the innermost loop depends on the index of the previous for loop.

Elrod · August 31, 2021, 6:53pm

Yes. On Intel CPUs, for extremely simple things and for large sized dense/general matrix multiplication, OpenBLAS is pretty good (although with multithreading, MKL is much better).
For everything else, MKL is much faster.

Switching to OpenBLAS, I get (multithreaded):

julia> @btime Array($A)*$B;
  41.434 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$B;
  30.012 μs (2 allocations: 78.17 KiB)

julia> @btime A_mul_B($A,$B);
  23.945 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B_standard($A,$B);
  21.882 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$C;
  52.116 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B($A,$C);
  34.013 μs (6 allocations: 234.52 KiB)

julia> @btime A_mul_B_standard($A,$C);
  32.659 μs (6 allocations: 234.52 KiB)

single threaded:

julia> @btime Array($A)*$B;
  42.789 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$B;
  37.174 μs (2 allocations: 78.17 KiB)

julia> @btime A_mul_B($A,$B);
  44.960 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B_standard($A,$B);
  33.905 μs (4 allocations: 156.34 KiB)

julia> @btime $A*$C;
  53.058 μs (4 allocations: 156.34 KiB)

julia> @btime A_mul_B($A,$C);
  53.369 μs (6 allocations: 234.52 KiB)

julia> @btime A_mul_B_standard($A,$C);
  41.490 μs (6 allocations: 234.52 KiB)

Here A_mul_B_standard is fastest.

If you’re willing to try a beta version of Julia, it should be really easy on Julia >= 1.7.
For reference, my machine and julia version:

julia> versioninfo()
Julia Version 1.8.0-DEV.459
Commit e7c82a5bd2* (2021-08-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

To get a good block size, you can (for Float64):

using LoopVectorization
mr, nr = LoopVectorization.matmul_params();
mrWf64 = mr * LoopVectorization.pick_vector_width(Float64);
mrWf64, nr

it’d help to make the loops for the blocks to statically sized as well, or to at least have an upper bound on size using LoopVectorization.UpperBoundedInteger to define maximum sizes (e.g. for i in 1:LoopVectorization.UpperBoundedInteger(n, mrWf64)), which could also lead to improved code generation.

I am glacially adding support for triangular/symmetric loops to LoopVectorization. So this should come eventually.

fipelle · August 31, 2021, 8:11pm

Thank you.

I’d try it, but I am also using Julia on a university cluster and it might take too long for having it installed. I think it might be worth waiting for the stable release at this point.

This is extremely interesting. I am a big fan of LoopVectorization.jl so far. I am wrapping up a new time series package (a compendium to TSAnalysis.jl) where I am using it to speed up quite a few calculations. Having this additional support would be wonderful!