Hi,

I am playing around with LoopVectorization.jl. In doing so, I have noticed that it seems to be about 1.5x faster than LinearAlgebra in computing the product between two symmetric matrices (while being about as fast in computing the product between symmetric and non-symmetric matrices). Is this expected behaviour?

I have written down a small example to highlight this point.

```
# Packages
using BenchmarkTools, LinearAlgebra, LoopVectorization, Random;
@inline function A_mul_B(A::AbstractMatrix{T}, B::AbstractMatrix{T}) where T<:Real
# Memory pre-allocation
C = zeros(size(A, 1), size(B, 2));
# Compute matrix product
@tturbo for j in axes(B, 2)
for i in axes(A, 1)
C_ij = zero(T)
for k in axes(A, 2)
C_ij += A[k,i]*B[k,j];
end
C[i,j] = C_ij;
end
end
return C;
end
A_mul_B(A::Symmetric{T, Matrix{T}}, B::AbstractMatrix{T}) where T<:Real = A_mul_B(Array(A), B);
A_mul_B(A::Symmetric{T, Matrix{T}}, B::Symmetric{T, Matrix{T}}) where T<:Real = A_mul_B(Array(A), Array(B));
```

The benchmarks below show that the advantage is visible only with `A*C`

.

```
Random.seed!(1);
A = Symmetric(randn(100,100));
B = randn(100,100);
C = Symmetric(randn(100,100));
# Symmetric times non-symmetric
@btime Array($A)*$B;
@btime $A*$B;
@btime A_mul_B($A,$B);
# Symmetric times symmetric
@btime $A*$C;
@btime A_mul_B($A,$C);
```

In my testing environment both `Threads.nthreads()`

and `BLAS.get_num_threads()`

are equal to 4. `Array(...)`

slow things down a bit, but I could not find a better option to get the symmetric matrix in array format.