Why is matrix times matrix transpose not a Symmetric / Hermitian matrix


I often write code like this: A * A' where A is either a real or complex matrix. Obviously the outcome is a symmetric or a hermitian matrix respectively.

I wondered why this is not saved as a Symmetric matrix, e.g.

julia> A = randn(3,4);
julia> B = A * A'
3×3 Array{Float64,2}:
  0.813621   0.716845  -1.26026
  0.716845   4.08373   -1.93742
 -1.26026   -1.93742    2.33786

as apposed to

julia> B = Symmetric(A*A')
3×3 Symmetric{Float64,Array{Float64,2}}:
  0.813621   0.716845  -1.26026
  0.716845   4.08373   -1.93742
 -1.26026   -1.93742    2.33786

Symmetric would have the advantage, that subsequent eig(B) or chol(B) could be calculated more efficiently.
Does A * A' actually calculate every entry or is it clever enough to only calculate the upper or lower triangle?
If yes, can it cope with expressions like this too?:

C = (Y - y) * W * (Y - y)'

where Y is a matrix and W is a diagonal matrix and y is a vector. The outcome C is also a symmetric / hermitian matrix.


This requires lazy transpose so that way it can dispatch on the fact that the right matrix is the adjoint. That couldn’t be done in v0.6, but a check for equality and then doing the symmetric could be an interesting dispatch in v0.7/v1.0?


Is there a function that calculates A * A' efficiently as a Symmetric matrix?
Something like:

B = multiply_with_its_transpose(A)

or if I’d like a weight W too:

B = multiply_with_its_transpose(A, W)


You could look at using the dsyrk routine from the BLAS wrapper.

This is a pretty relevant article: https://software.intel.com/en-us/articles/a-matrix-multiplication-routine-that-updates-only-the-upper-or-lower-triangular-part-of-the


You can check (here on 0.6 / openblas):

r=rand(5000,5000); rr=rand(5000,5000); A_mul_Bt(r,r); A_mul_Bt(r,rr);
@time A_mul_Bt(r,r);
#  3.441910 seconds (85 allocations: 190.741 MiB, 0.56% gc time)
@time A_mul_Bt(r,rr);
#  6.626233 seconds (6 allocations: 190.735 MiB, 1.31% gc time)

And let’s make sure that this is not due to accidentially better L1 use or something:

rr3=unsafe_wrap(Array, pointer(r), size(r));
@time A_mul_Bt(r,rr3);
#  6.580791 seconds (6 allocations: 190.735 MiB, 0.40% gc time)


That’s extremely nice! The magic happens in matmul.jl (in v0.7, but it also works in 0.6):

mul!(C::StridedMatrix{T}, transA::Transpose{<:Any,<:StridedVecOrMat{T}}, B::StridedVecOrMat{T}) where {T<:BlasFloat} =
    (A = transA.parent; A===B ? syrk_wrapper!(C, 'T', A) : gemm_wrapper!(C, 'T', 'N', A, B))

For A'*B*A I don’t think it’s in base BLAS, so you have to use the MKL extensions manually, or find another trick (for instance if B is diagonal or something so you can compute sqrt(B)A).

It does not produce a Symmetric automatically because that would not be type stable: just do Symmetric(A’A), there’s no further allocation.


Thanks, unfortunately I could not reproduce the time saving for small matrices:

julia> const A = randn(10,20); const A1 = randn(10,20);
julia> @benchmark A_mul_Bt(A,A)
  memory estimate:  896 bytes
  allocs estimate:  1
  minimum time:     641.884 ns (0.00% GC)
  median time:      653.293 ns (0.00% GC)
  mean time:        701.441 ns (2.00% GC)
  maximum time:     4.454 μs (67.48% GC)
  samples:          10000
  evals/sample:     164

julia> @benchmark A_mul_Bt(A,A1)
  memory estimate:  896 bytes
  allocs estimate:  1
  minimum time:     520.005 ns (0.00% GC)
  median time:      526.209 ns (0.00% GC)
  mean time:        576.763 ns (2.38% GC)
  maximum time:     3.825 μs (69.85% GC)
  samples:          10000
  evals/sample:     191

This is faster though:

julia> @benchmark Symmetric(BLAS.syrk('U', 'N', 1, A))
  memory estimate:  928 bytes
  allocs estimate:  2
  minimum time:     575.412 ns (0.00% GC)
  median time:      584.429 ns (0.00% GC)
  mean time:        647.794 ns (3.23% GC)
  maximum time:     7.375 μs (78.09% GC)
  samples:          10000
  evals/sample:     182

For comparison:

julia> @benchmark A * A'
  memory estimate:  896 bytes
  allocs estimate:  1
  minimum time:     645.717 ns (0.00% GC)
  median time:      670.588 ns (0.00% GC)
  mean time:        729.945 ns (2.24% GC)
  maximum time:     4.999 μs (73.14% GC)
  samples:          10000
  evals/sample:     159

It looks like A * A' is essentially the same as A_mul_Bt(A,A)

Symmetric(BLAS.syrk('U', 'N', 1, A)) seems to be the fastest even for large matrices:

julia> const E = randn(5000,6000);

julia> @benchmark Symmetric(BLAS.syrk('U', 'N', 1, E))
  memory estimate:  190.73 MiB
  allocs estimate:  3
  minimum time:     1.120 s (0.85% GC)
  median time:      1.125 s (1.01% GC)
  mean time:        1.160 s (3.11% GC)
  maximum time:     1.241 s (6.34% GC)
  samples:          5
  evals/sample:     1

julia> @benchmark A_mul_Bt(E,E)
  memory estimate:  190.73 MiB
  allocs estimate:  2
  minimum time:     1.229 s (0.78% GC)
  median time:      1.281 s (2.71% GC)
  mean time:        1.271 s (2.88% GC)
  maximum time:     1.293 s (5.21% GC)
  samples:          4
  evals/sample:     1

Unfortunately, it does not allow for a weight as in A * W * A' without using sqrt.

Do you know of a MKL function that support this?


Resurrecting this topic: it would be nice to have a high-level function for A'*A and A*A' that does the “right thing”, eg the result is Symmetric for generic A, Diagonal for A::Diagonal.

I am using this so often that if this functionality is not available, I would do a mini-package, but thought I would check first.


Yes please! I think overloading * is enough?


Nope, that would be type piracy.

Package (for v0.7) is here, comments welcome:


Ah sorry, I forgot my previous comment… It would be great to have this done automatically, but it can’t because it wouldn’t be type-stable. So yes, a package is the next best thing.


I think a dedicated function is actually better, since you don’t have to repeat the argument. Eg with the package above,


as opposed to

A = from_my(complicated_calculation)


from_my(complicated_calculation) |> X -> X*X'