It is generally accepted that Julia is for
loops friendly, so generally no special care is required to vectorize code, because Julia will do a good job anyway. Let A and X be n x n matrices, with X symmetric. The following two functions construct the symmetric matrix Y = AX+XA’ and stores the n*(n+1)/2 elements of its upper triangular part in a columnwise compacted vectorized form.
function test1!(y,A,X)
n = size(A,1)
Y = similar(X,n,n)
mul!(Y, X, A')
mul!(Y, A, X, true, true)
k = 1
for j = 1:n
for i = 1:j
y[k] = Y[i,j]
k += 1
end
end
end
function test2!(y,A,X)
n = size(A,1)
ZERO = zero(eltype(y))
k = 1
for j = 1:n
for i = 1:j
temp = ZERO
for l = 1:n
temp += (A[i,l] * X[l,j] + X[i,l] * A[j,l])
end
y[k] = temp
k += 1
end
end
end
The function test1!
forms Y explicitly allocating it as an n x n work array and performs 2n^3 flops, while test2!
computes directly the elements of the upper triangular part, without any allocation and performing n^2(n+1) flops (about the half of previous figure).
Comparing the performance of the two functions for n = 1000, I get the following results
using LinearAlgebra
using BenchmarkTools
n = 1000
A = rand(n,n); X = hermitianpart!(rand(n,n));
y = similar(A,Int(n*(n+1)/2));
y1 = similar(A,Int(n*(n+1)/2));
julia> @btime test1!(y,$A,$X)
1.825 s (4 allocations: 7.63 MiB)
julia> @btime test2!(y1,$A,$X)
3.512 s (1 allocation: 32 bytes)
Surprisingly (for me!), test1!
is two times faster than test2!
, although I expected the oposite. The computed values agrees:
j
ulia> @time test1!(y,A,X)
1.846396 seconds (3 allocations: 7.629 MiB)
julia> @time test2!(y1,A,X)
3.509423 seconds
julia> @show norm(y-y1)/norm(y)
norm(y - y1) / norm(y) = 8.752998768126267e-16
8.752998768126267e-16
I am convinced I am overlooking something trivial, but I don’t know what.
Many thanks for your time in advance for any hint.
I am working with Julia 1.11.5 on Windows 11
julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Environment:
JULIA_EDITOR = code
JULIA_VSCODE_REPL = 1