Unexpected poor FOR loops performance

andreasvarga · June 27, 2025, 6:14pm

I implemented a third function using your suggestion:

function test3!(y,A,X)
    n = size(A,1)
    Y = similar(X,n,n);
    mul!(Y, A, X)
    k = 1
    for j = 1:n
        for i = 1:j
            y[k] = Y[i,j]+Y[j,i]
            k += 1
        end
    end
end

With this I got

using LinearAlgebra
using BenchmarkTools
n = 1000
A = rand(n,n); X = hermitianpart!(rand(n,n));
y1 = similar(A,Int(n*(n+1)/2));
y2 = similar(A,Int(n*(n+1)/2));
y3 = similar(A,Int(n*(n+1)/2));

julia> @btime test1!(y1,$A,$X)
  1.833 s (4 allocations: 7.63 MiB)

julia> @btime test2!(y2,$A,$X)
  3.490 s (1 allocation: 32 bytes)

julia> @btime test3!(y3,$A,$X)
  7.653 ms (4 allocations: 7.63 MiB)

So, there is a speedup using test3! of about 250 times with respect to test2! and about 450 with respect to test1! . This is certainly the reference speed to be achieved by an ideal code without allocations!

The explanation for the huge speedup is the asymmetry in implementing mul! involving transposing or not

julia> Y = similar(X,n,n);

julia> @btime mul!($Y, $X, $A');
  1.840 s (0 allocations: 0 bytes)

julia> @btime mul!($Y, $X, $A);
  5.276 ms (0 allocations: 0 bytes)

Topic		Replies	Views
Something faster than for loops General Usage	27	6561	May 8, 2019
Why fewer memory allocations does not necessarily suggest higher speed New to Julia performance , memory-allocation	5	846	June 6, 2021
Product of two symmetric matrices: LoopVectorization.jl vs LinearAlgebra Performance blas , linearalgebra , loopvectorization	9	1022	August 31, 2021
Why is this simple function twice as slow as its Python version Performance question	97	4888	April 12, 2021
Why mul! is so fast? General Usage question , linearalgebra	7	7021	November 26, 2019

Unexpected poor FOR loops performance

Related topics