Unexpected poor FOR loops performance

I implemented a third function using your suggestion:

function test3!(y,A,X)
    n = size(A,1)
    Y = similar(X,n,n);
    mul!(Y, A, X)
    k = 1
    for j = 1:n
        for i = 1:j
            y[k] = Y[i,j]+Y[j,i]
            k += 1
        end
    end
end

With this I got

using LinearAlgebra
using BenchmarkTools
n = 1000
A = rand(n,n); X = hermitianpart!(rand(n,n));
y1 = similar(A,Int(n*(n+1)/2));
y2 = similar(A,Int(n*(n+1)/2));
y3 = similar(A,Int(n*(n+1)/2));
julia> @btime test1!(y1,$A,$X)
  1.833 s (4 allocations: 7.63 MiB)

julia> @btime test2!(y2,$A,$X)
  3.490 s (1 allocation: 32 bytes)

julia> @btime test3!(y3,$A,$X)
  7.653 ms (4 allocations: 7.63 MiB)

So, there is a speedup using test3! of about 250 times with respect to test2! and about 450 with respect to test1! . This is certainly the reference speed to be achieved by an ideal code without allocations!

The explanation for the huge speedup is the asymmetry in implementing mul! involving transposing or not

julia> Y = similar(X,n,n);

julia> @btime mul!($Y, $X, $A');
  1.840 s (0 allocations: 0 bytes)

julia> @btime mul!($Y, $X, $A);
  5.276 ms (0 allocations: 0 bytes)
2 Likes