Why is for loops so slow?

I tried to write a simple code for solving a system of equations using Gauss elimination. The coefficient matrix is symmetric. This code runs more than 16 times slower than the built-in x=A\b algorithm. I’ve tried almost everything (LoopVectorization, Threads, inbounds, simd, tullio, etc.) without any effect. Factor 16 in the difference in speed seems to be very high. What might be the problem?

using LinearAlgebra
using BenchmarkTools

# Creating system of linear equation with n unknowns
# Let the coefficient matrix symmetric

n = 1000
A = floor.(10 * rand(n,n))
A0::Matrix{Float64} = A + transpose(A)
x0::Vector{Float64} = floor.(10 * rand(n))
b = A0 * x0

#A0=[1. 3 2; 3 2 1; 2 1 3]
#b=[13.; 10; 13]

# Method 1: Gauss elimination on a dense matrix
x1 = A0 \ b
@btime $A0 \ $b

# Method 2: Let's take advantage of the fact that the coefficient matrix is ​​symmetric
A2 = copy(A0)
b2 = copy(b)
x2 = similar(b2) # Is it needed?

function symgauss!(A::Matrix{Float64}, b::Vector{Float64})
    @assert size(A, 1) == size(A, 2) && size(A, 1) == length(b) "Size mismatch"
    # Gauss elimination
    n = length(b)
    d::Float64 = 0
    e::Float64 = 0
    for k ∈ 2:n
        e = A[k-1, k-1]
        for i ∈ k:n
            d = A[k-1, i] / e
            for j ∈ i:n
                A[i, j] -= A[k-1, j] * d
            end
            #A[i, i:n] .-= A[k-1, i:n] * d
            b[i] -= b[k-1] * d
        end
    end
    # Back substitution
    x = similar(b)
    c::Float64 = 0
    for i ∈ n:-1:1
        c = 0
        for j ∈ n:-1:i+1
            c += A[i, j] * x[j]
        end
        x[i] = (b[i] - c) / A[i, i]
    end
    return x
end

x2 = symgauss!(A2, b2)
A2 = copy(A0)
@btime symgauss!($A2, $b2)

println(sum(abs2, x1-x2))

with results:

 16.270 ms (4 allocations: 7.64 MiB)
  269.603 ms (1 allocation: 7.94 KiB)
5.575447398718873e-17

I suspect it’s much more likely that you’re simply using a slower algorithm than any problem with inefficient for loops.

3 Likes

You can always check what method is being called with @which or inspect the method source code in your favourite editor (@edit) or the REPL:

julia> @less A0 \ b
function (\)(A::AbstractMatrix, B::AbstractVecOrMat)
    require_one_based_indexing(A, B)
    m, n = size(A)
    if m == n
        if istril(A)
            if istriu(A)
                return Diagonal(A) \ B
            else
                return LowerTriangular(A) \ B
            end
        end
        if istriu(A)
            return UpperTriangular(A) \ B
        end
        return lu(A) \ B
    end
    return qr(A, ColumnNorm()) \ B
end

so in your case for square A it’s doing an LU factorization and then calls a specialized \ version for the resulting factorization:

julia> @less lu(A0) \ b
function (\)(F::Union{<:LAPACKFactorizations,Adjoint{<:Any,<:LAPACKFactorizations}}, B::AbstractVecOrMat)
(...)

and you can follow the trail from there if you want to see the algorithm (although I suppose it ends up in LAPACK?)

1 Like

Writing an efficient Gaussian elimination routine isn’t quite that simple and there are some pretty big benefits to some of the optimizations done in LAPACK which is what \ will eventually call in this case. (The big win is in formulating a block variant of the algorithm and then calling out to level 3 BLAS). However there is one easy issue with your code, beyond that. The innermost loop iterates over columns instead of rows of A. That is slower. If you store the multipliers in the lower triangular part of A, you can fix that as follows:

function symgauss!(A::Matrix{Float64}, b::Vector{Float64})
    @assert size(A, 1) == size(A, 2) && size(A, 1) == length(b) "Size mismatch"
    # Gauss elimination
    n = length(b)
    d::Float64 = 0
    e::Float64 = 0
    for k ∈ 2:n
        e = A[k-1, k-1]
        for i ∈ k:n
            A[i, k-1] = A[k-1, i] / e
            b[i] -= b[k-1] * A[i, k-1]
        end
        for j ∈ k:n
            for i ∈ k:j
                A[i,j] -= A[k-1, j] * A[i, k-1]
            end
        end
    end
    # Back substitution
    x = similar(b)
    c::Float64 = 0
    for i ∈ n:-1:1
        c = 0
        for j ∈ n:-1:i+1
            c += A[i, j] * x[j]
        end
        x[i] = (b[i] - c) / A[i, i]
    end
    return x
end

This still does exploit symmetry in terms of avoiding updating Schur complements in the lower triangular part of A, but it does overwrite both the lower and upper triangular parts. It runs in about half the time on my system. The rest of the difference is solidly within the normal range of the benefit from using level 3 BLAS in a blocked version of the algorithm.

It’s worth noting that without pivoting, this algorithm is unstable without some additional assumption like positive definiteness or diagonal dominance. It also fails with zero pivots sometimes because of the way you generated A to contain integer values.

Edit: Assuming this is for educational purposes, writing this sort of code can be instructive, but you don’t want to use your own Gaussian elimination in a practical setting unless you want to put a lot more work into it. Along with the lack of pivoting, normally you would return a factorization instead of doing elimination on b (An LDL^T factorization for symmetric indefinite A) and then handle forward/backward solve separately.

4 Likes

My aim is to develop a simple code for skyline technique (similar to this). The code above is the first step. I would like to understand each step of the skyline technique and make a faster equation solver than the built-in code in julia. The coefficient matrix comes from FEM, hence it is always symmetric, positive definite, no preliminary examination is needed.

To elaborate on @mstewart’s answer above, even for the simpler problem of multiplying two m \times m matrices, which naively takes about 10 lines of code (3 nested loops), the simplest implementations (in any language!) are typically orders of magnitude slower than highly optimized algorithms (which do the “same” \sim 2m^3 floating-point operations; the theoretical lower-complexity matrix-mult algorithms are virtually never used). Optimized matrix-multiply “BLAS” libraries often devote \gtrsim 10,000 lines of code to this problem! It’s not a question of the speed of “for loops” — optimized code completely re-organizes the algorithm into “block” operations that have better memory locality, for example.

See also the discussion thread: Julia matrix-multiplication performance

It is very hard to beat (or even come close to) the performance of highlyoptimized libraries for basic operations on generic dense matrices, even if you understand these performance issues! This is true in any language, even in compiled languages like C. (Where you can do better is if your matrices are very special and and you can exploit that for improved algorithms.)

Note, by the way, that all of these type declarations are not required for performance. As long as you initialize your variables in a type-stable way like d = zero(eltype(b)); e = zero(eltype(A)), you can omit all of the type declarations and the code will be just as fast … and more generic, since it will work on matrices of any numeric type. (This is a common misconception.) See “Argument-Type Declarations” in the Julia manual.

6 Likes

Thanks for the answers. I think that skyline technique is more suitable to solve a system of equations that comes from FEM, that’s why it must be faster than a general-purpose solver. Or a well-optimized general-purpose solver is faster than a simple code for a special (and basically faster) algorithm? I would like to decide which one do I use for a complex problem. The aim is the speed. Maybe there is any solver in julia especially for FEM problems?

BLAS will use threads efficiently, what have you done to try to exploit that?

Writing a competitive dense solver is hard as @stevengj noted in his response. Writing a competitive direct sparse solver is even harder. The best solvers are written by specialists who have devoted years of their lives to that sort of work. I believe that the Cholesky used by Julia for a sparse matrix is the supernodal Cholesky from SuiteSparse mentioned in the README from the SkylineSolvers.jl package you linked to. If your system has useful sparsity structure, the built in sparse Cholesky is going to be much faster than the use of \ on a dense matrix that you have been benchmarking. You would be better off generating a more realistic test case for your problem, making sure it is stored as a sparse matrix and then using the built-in cholesky. You aren’t likely to get close to that with your own code.

If you really want to try to implement something for educational purposes (which I think is a great thing to do, by the way assuming you have realistic expectations going in), Direct Methods for Sparse Linear Systems would be a very good place to start with reading.

Edit: A starting place for comparison would be

using SparseArrays
n=30
B = spdiagm(-1 => fill(-1.0, n - 1), 1 => fill(-1.0, n - 1), 0 => fill(4.0, n))
A =
  kron(spdiagm(0 => fill(1.0, n)), B) +
  spdiagm(-n => fill(-1.0, n^2 - n), n => fill(-1.0, n^2 - n))
F=cholesky(A)
@btime cholesky($A)
b = randn(n^2)
@btime $F \ $b

Where, of course, you would run it on a real FEM problem.

1 Like

FEM matrices are sparse. To take advantage of this, you must use some kind of sparse-matrix data structure (e.g. which only stores the nonzero entries). As @mstewart noted, writing an efficient sparse-matrix solver is also quite difficult, and is completely different than writing dense solvers.

Julia comes with a set of sparse-matrix solvers from SuiteSparse, e.g. UMFPACK (sparse LU) and CHOLMOD (sparse Cholesky, for SPD sparse matrices). There are also several other libraries you can load, such as MUMPS.jl and Pardiso.jl, as well as completely different approaches like Krylov iterative methods. Unless you are a specialist with a lot of time, you are very unlikely to beat the performance of these sparse-direct libraries for an FEM matrix.

So, I would approach this only as a learning exercise, but without high expectations for the performance of your code vs. optimized linear-algebra libraries.