Unexpecteed poor FOR loops performance

It is generally accepted that Julia is for loops friendly, so generally no special care is required to vectorize code, because Julia will do a good job anyway. Let A and X be n x n matrices, with X symmetric. The following two functions construct the symmetric matrix Y = AX+XA’ and stores the n*(n+1)/2 elements of its upper triangular part in a columnwise compacted vectorized form.

function test1!(y,A,X)
    n = size(A,1)
    Y = similar(X,n,n)
    mul!(Y, X, A')
    mul!(Y, A, X, true, true)
    k = 1
    for j = 1:n
        for i = 1:j
            y[k] = Y[i,j]
            k += 1
        end
    end
end
function test2!(y,A,X)
    n = size(A,1)
    ZERO = zero(eltype(y))
    k = 1
    for j = 1:n
        for i = 1:j
            temp = ZERO
            for l = 1:n
                temp += (A[i,l] * X[l,j] + X[i,l] * A[j,l])
            end
            y[k] = temp
            k += 1
        end
    end
end

The function test1! forms Y explicitly allocating it as an n x n work array and performs 2n^3 flops, while test2! computes directly the elements of the upper triangular part, without any allocation and performing n^2(n+1) flops (about the half of previous figure).
Comparing the performance of the two functions for n = 1000, I get the following results

using LinearAlgebra
using BenchmarkTools
n = 1000
A = rand(n,n); X = hermitianpart!(rand(n,n));
y = similar(A,Int(n*(n+1)/2));
y1 = similar(A,Int(n*(n+1)/2));
julia> @btime test1!(y,$A,$X)
  1.825 s (4 allocations: 7.63 MiB)

julia> @btime test2!(y1,$A,$X)
  3.512 s (1 allocation: 32 bytes)

Surprisingly (for me!), test1! is two times faster than test2!, although I expected the oposite. The computed values agrees:
j

ulia> @time test1!(y,A,X)
  1.846396 seconds (3 allocations: 7.629 MiB)

julia> @time test2!(y1,A,X)
  3.509423 seconds

julia> @show norm(y-y1)/norm(y)
norm(y - y1) / norm(y) = 8.752998768126267e-16
8.752998768126267e-16

I am convinced I am overlooking something trivial, but I don’t know what.
Many thanks for your time in advance for any hint.

I am working with Julia 1.11.5 on Windows 11

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1

This is almost certainly because mul! is highly optimised, and much faster than anything you’d write by hand. Therefore, even if it does more work, it’s still overall faster.

9 Likes

(post deleted by author)

Without involving BLAS, the timing results are as expected:

julia> n = 100
100

julia> A = rand(BigFloat,n,n); X = hermitianpart!(rand(BigFloat,n,n));

julia> y = similar(A,Int(n*(n+1)/2));

julia> y1 = similar(A,Int(n*(n+1)/2));

julia> @btime test1!(y,$A,$X)
  319.409 ms (4280004 allocations: 210.50 MiB)

julia> @btime test2!(y1,$A,$X)
  182.315 ms (4040005 allocations: 200.35 MiB)
1 Like

For me the important question is: which version of code should I use in an iterative solver (e.g., for solving Lyapunov matrix equations): the faster one with allocation of an n x n array, or the slower one, which performs no allocation?

If you’re doing O(n^3) work, and n is sufficiently large, it’s almost always worth the O(n^2) memory allocation to call a fast BLAS routine. That being said, if you are doing this over and over again in an iterative solver, you can probably pre-allocate any arrays you need for BLAS/LAPACK.

PS. Note that this has nothing to do with for loops or Julia, and everything to do with the tricks that are required to make matrix–matrix multiplications fast. These kinds of optimizations can be done in Julia too (e.g. see Octavian.jl), but require a different algorithm than textbook-style 3-nested loops: Julia matrix-multiplication performance

11 Likes

Note that when X is hermitian such that X = X^*, it holds that AX = (XA^*)^*. Define the intermediate Z = A * X. It is faster (by nearly 2x at non-small sizes) to compute Y = A X + X A^* as Y = Z + Z' rather than Y = Z + X * A'. You could save a tiny bit more effort if you only compute half of Y (like you discuss), but since it’s only an addition the savings are extremely small.

3 Likes

I implemented a third function using your suggestion:

function test3!(y,A,X)
    n = size(A,1)
    Y = similar(X,n,n);
    mul!(Y, A, X)
    k = 1
    for j = 1:n
        for i = 1:j
            y[k] = Y[i,j]+Y[j,i]
            k += 1
        end
    end
end

With this I got

using LinearAlgebra
using BenchmarkTools
n = 1000
A = rand(n,n); X = hermitianpart!(rand(n,n));
y1 = similar(A,Int(n*(n+1)/2));
y2 = similar(A,Int(n*(n+1)/2));
y3 = similar(A,Int(n*(n+1)/2));
julia> @btime test1!(y1,$A,$X)
  1.833 s (4 allocations: 7.63 MiB)

julia> @btime test2!(y2,$A,$X)
  3.490 s (1 allocation: 32 bytes)

julia> @btime test3!(y3,$A,$X)
  7.653 ms (4 allocations: 7.63 MiB)

So, there is a speedup using test3! of about 250 times with respect to test2! and about 450 with respect to test1! . This is certainly the reference speed to be achieved by an ideal code without allocations!

The explanation for the huge speedup is the asymmetry in implementing mul! involving transposing or not

julia> Y = similar(X,n,n);

julia> @btime mul!($Y, $X, $A');
  1.840 s (0 allocations: 0 bytes)

julia> @btime mul!($Y, $X, $A);
  5.276 ms (0 allocations: 0 bytes)
1 Like

I’m seeing the same thing. That’s definitely not how it should be. The issue is almost certainly to do with it not figuring out the best thing to do for that argument pair. Note that if you first copy it to a Matrix (despite the added cost of copying it) you should get very close to the original performance.

julia> @btime mul!($Y, Matrix($X), $A');

Opened an issue Low performance for Hermitian-times-adjoint matrix multiply · Issue #1393 · JuliaLang/LinearAlgebra.jl · GitHub.

2 Likes