Thanks, unfortunately I could not reproduce the time saving for small matrices:
julia> const A = randn(10,20); const A1 = randn(10,20);
julia> @benchmark A_mul_Bt(A,A)
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 1
--------------
minimum time: 641.884 ns (0.00% GC)
median time: 653.293 ns (0.00% GC)
mean time: 701.441 ns (2.00% GC)
maximum time: 4.454 μs (67.48% GC)
--------------
samples: 10000
evals/sample: 164
julia> @benchmark A_mul_Bt(A,A1)
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 1
--------------
minimum time: 520.005 ns (0.00% GC)
median time: 526.209 ns (0.00% GC)
mean time: 576.763 ns (2.38% GC)
maximum time: 3.825 μs (69.85% GC)
--------------
samples: 10000
evals/sample: 191
This is faster though:
julia> @benchmark Symmetric(BLAS.syrk('U', 'N', 1, A))
BenchmarkTools.Trial:
memory estimate: 928 bytes
allocs estimate: 2
--------------
minimum time: 575.412 ns (0.00% GC)
median time: 584.429 ns (0.00% GC)
mean time: 647.794 ns (3.23% GC)
maximum time: 7.375 μs (78.09% GC)
--------------
samples: 10000
evals/sample: 182
For comparison:
julia> @benchmark A * A'
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 1
--------------
minimum time: 645.717 ns (0.00% GC)
median time: 670.588 ns (0.00% GC)
mean time: 729.945 ns (2.24% GC)
maximum time: 4.999 μs (73.14% GC)
--------------
samples: 10000
evals/sample: 159
It looks like A * A'
is essentially the same as A_mul_Bt(A,A)
Symmetric(BLAS.syrk('U', 'N', 1, A))
seems to be the fastest even for large matrices:
julia> const E = randn(5000,6000);
julia> @benchmark Symmetric(BLAS.syrk('U', 'N', 1, E))
BenchmarkTools.Trial:
memory estimate: 190.73 MiB
allocs estimate: 3
--------------
minimum time: 1.120 s (0.85% GC)
median time: 1.125 s (1.01% GC)
mean time: 1.160 s (3.11% GC)
maximum time: 1.241 s (6.34% GC)
--------------
samples: 5
evals/sample: 1
julia> @benchmark A_mul_Bt(E,E)
BenchmarkTools.Trial:
memory estimate: 190.73 MiB
allocs estimate: 2
--------------
minimum time: 1.229 s (0.78% GC)
median time: 1.281 s (2.71% GC)
mean time: 1.271 s (2.88% GC)
maximum time: 1.293 s (5.21% GC)
--------------
samples: 4
evals/sample: 1
Unfortunately, it does not allow for a weight as in A * W * A'
without using sqrt.
Do you know of a MKL function that support this?