How to improve the Thomas algorithm for block tridiagonal matrices

karei · March 13, 2025, 4:00am

I did some searching but it seems that Julia doesn’t implement a specific algorithm for block tridiagonal matrices. So I implemented one myself. I would like to know:

Have I missed some existing algorithm implementations for block tridiagonal matrices?
is there any room for improvement of the solver function I wrote?

function block_thomas_tridiagonal(A::AbstractVector{T}, B::AbstractVector{T}, C::AbstractVector{T},
    d::AbstractVector) where {T}
    B = copy(B)
    n = size(B[1], 1) # rows of each block
    m = length(B) # number of blocks

    # Forward elimination
    for i = 2:m # (no speed up using inbounds+simd)
        multiplier = A[i-1] * inv(B[i-1])
        B[i] -= multiplier * C[i-1]
        d[n*(i-1)+1:n*i] .-= multiplier * d[n*(i-2)+1:n*(i-1)]
    end

    # Backward substitution
    x = similar(d)
    x[n*(m-1)+1:n*m] = B[m] \ d[n*(m-1)+1:n*m] # no speed up using .=
    for i = m-1:-1:1
        x[n*(i-1)+1:n*i] = B[i] \ (d[n*(i-1)+1:n*i] - C[i] * x[n*i+1:n*(i+1)])
    end

    return x
end

This function is specifically designed to handle the case where every block has the same size. It usually takes a Vector{Matrix{Float64}} as input and solves each block one by one.

LaurentPlagne · March 13, 2025, 8:37am

This parallel (thread+simd) C++ implementation may help you to implement a performant Julia implementation of Thomas Algorithm. Be aware that for large problem exceeding cache size, the algorihm is memory bound:

Note also that this algorithm is well suited for GPU implementation.

stevengj · March 13, 2025, 11:18am

Don’t use explicit matrix inversion. You could use A[i-1] / inv(B[i-1]). If the blocks are all the same size, as you seem to be assuming here, you could preallocate buffers for the LU factorization, the multipliers, and the other matrix multiplications.

That’s because .= is for elementwise/broadcasting operations, not matrix operations. If you want to do the latter in-place, use ldiv! and mul! etc.

Beware that this algorithm may be numerically unstable, IIRC, e.g. if any of the blocks are nearly singular.

mstewart · March 13, 2025, 12:46pm

It is unfortunately. It’s even unstable for a non-block symmetric tridiagonal. (And the block case is going be worse with the explicit inverse you pointed out.) You really need some form of pivoting for this to guarantee stability. I put an example for the factorization variant used in ldlt in a post last year.

I just showed the factorization backward error, but the thing that seems more troubling to me is that the symmetric tridiagonal solver in Julia uses an unstable algorithm. With a relative residual for the solver, the example is:

using LinearAlgebra
A = SymTridiagonal([
  3e-10 1 0
  1     2 1 
  0     1 2
])

F = ldlt(A)
display(A - F.L * F.D * F.L')

b=[1.0, 2.0, 3.0]
x=A\b
@show norm(A*x-b) / opnorm(A) / norm(x)

which gives output

3×3 Matrix{Float64}:
 0.0  0.0         0.0
 0.0  4.76837e-7  0.0
 0.0  0.0         0.0
(norm(A * x - b) / opnorm(A)) / norm(x) = 8.692567662779919e-8
8.692567662779919e-8

LAPACK avoids this by using LU with partial pivoting for its tridiagonal solver. However, in addition to the permutation, you pick up an extra superdiagonal in U. So it isn’t a fully in place algorithm like the elegant algorithm implemented in ldlt!. The alternative is Bunch’s method that uses 2\times 2 diagonal blocks mentioned in my other post. You don’t need any permutations, but you do need to accept 2\times 2 blocks in D. Bunch’s algorithm has the further advantage of preserving symmetry, so by finding the eigenvalues in D you can reliably compute the inertia of a matrix. However, it also requires extra storage and you can’t represent the triangular factor as a UnitLowerTriangular a SymTriadiagonal as is done now. The structure of the factors doesn’t align as nicely with existing types in LinearAlgebra, which is annoying even if you are fine giving up on a fully in-place factorization.

karei · March 16, 2025, 6:03am

Thanks for the link. But I took a closer look in there and there is no implementation of Thomas’ algorithm for solving block tridiagonal matrices. There is only the normal Thomas algorithm. But it did help me improve the Thomas algorithm I originally wrote, thank you.

Thank you for the reply. I spent a little time on how to improve this algorithm based on your suggestion. Here is a comparison between the original and the improved version

original

function block_thomas_tridiagonal(L::AbstractVector{T}, D::AbstractVector{T}, U::AbstractVector{T},
    b::AbstractVector) where {T}
    D = deepcopy(D)
    n = size(D[1], 1) # each block has size n*n
    m = length(D) # num of blocks

    # Forward elimination
    for i = 1:m-1
        multiplier = L[i] * inv(D[i])
        D[i+1] -= multiplier * U[i]
        b[n*i+1:n*(i+1)] .-= multiplier * b[n*(i-1)+1:n*i]
    end

    # Backward substitution
    x = similar(b)
    x[n*(m-1)+1:n*m] = D[m] \ b[n*(m-1)+1:n*m] 
    for i = m-1:-1:1
        x[n*(i-1)+1:n*i] = D[i] \ (b[n*(i-1)+1:n*i] - U[i] * x[n*i+1:n*(i+1)])
    end

    return x
end

and improved

function block_thomas_tridiagonal!(x::AbstractVector, L::AbstractVector{T}, D::AbstractVector{T},
    U::AbstractVector{T}, b::AbstractVector) where {T}
    n = size(D[1], 1) # each block has size n*n
    m = length(D) # num of blocks
    (; lu_buf, D_buf) = get_block_thomas_buffer(n, m)

    # Forward elimination
    for i = 1:m-1
        lu_buf[i] = lu!(D[i])
        rdiv!(L[i], lu_buf[i])
        D[i+1] .-= mul!(D_buf[i], L[i], U[i])
        b[n*i+1:n*(i+1)] .-= L[i] * b[n*(i-1)+1:n*i]
    end

    # Backward substitution
    ldiv!(@view(x[n*(m-1)+1:n*m]), lu!(D[m]), b[n*(m-1)+1:n*m])
    for i = m-1:-1:1
        ldiv!(@view(x[n*(i-1)+1:n*i]), lu_buf[i], (b[n*(i-1)+1:n*i] - U[i] * x[n*i+1:n*(i+1)]))
    end

    return x
end

where the buffer is a vector of structures I wrote myself

mutable struct BlockThomas
    n::Int
    m::Int
    lu_buf::Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}
    D_buf::Vector{Matrix{Float64}}
    function BlockThomas(n, m)
        lu_buf = Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}(undef, m)
        D_buf = [mat(n, n) for _ = 1:m]
        return new(n, m, lu_buf, D_buf)
    end
end

const BlockThomasBuf = Vector{BlockThomas}(undef, nthreads())

function get_block_thomas_buffer(n::Int, m::Int)
    tid = threadid()
    return if !isassigned(BlockThomasBuf, tid)
        BlockThomasBuf[tid] = BlockThomas(n, m)
    elseif BlockThomasBuf[tid].n != n || BlockThomasBuf[tid].m != m
        BlockThomasBuf[tid] = BlockThomas(n, m)
    else
        BlockThomasBuf[tid]
    end
end

The average time for the benchmark changed from 460.868 μs Memory estimate: 692.58 KiB, allocs estimate: 1893 to 325.692 μs Memory estimate: 114.73 KiB, allocs estimate: 893., a 29% savings on this medium-sized example.

For the benefit of anyone who sees this post later, I’ve put both pieces of code in my repository.

The improved version does not explicitly use matrix inversion, but the two seem to be about the same in accuracy.

Random.seed!(1)
n = 10
m = 100
A = [rand(BigFloat, n, n) for _ = 1:m-1]
B = [rand(BigFloat, n, n) for _ = 1:m]
C = [rand(BigFloat, n, n) for _ = 1:m-1]
d = rand(BigFloat, m * n)
x = similar(d);
A_float64 = [Float64.(i) for i in A]
B_float64 = [Float64.(i) for i in B]
C_float64 = [Float64.(i) for i in C]
d_float64 = Float64.(d)
x_float64 = similar(d_float64)
original_result = block_thomas_tridiagonal(A, B, C, d)
result_float64 = block_thomas_tridiagonal(A_float64, B_float64, C_float64, d_float64)
error_norm = norm(original_result - result_float64, 1)
println("1-norm of the error: ", error_norm)

# original_result = block_thomas_tridiagonal!(x, A, B, C, d)
# result_float64 = block_thomas_tridiagonal!(x_float64, A_float64, B_float64, C_float64, d_float64)
# error_norm = norm(original_result - result_float64, 1)
# println("1-norm of the error: ", error_norm)

original: 1-norm of the error: 5.06330e-09
improved: 1-norm of the error: 1.1942802e-08
If you change the seed, sometimes the improved version has a smaller error.

But it does demonstrate that the Block Thomas algorithm has accumulated errors of the order of 1e-9 on this medium-sized example.

May I ask if there is a better algorithm for solving block tridiagonal matrix? I would like to trade-off accuracy and performance and utilize sparsity instead of solving the entire matrix.

yolhan_mannes · March 16, 2025, 7:45am

If your block are sparse, you should just build the whole sparse matrix and use either KLU or Kyrlov. If your blocks are Dense, Thomas will be the best (even though not stable as said before).

karei · March 16, 2025, 8:01am

That’s a really clear answer and solves my doubts, thank you!

rveltz · March 16, 2025, 8:15am

Yes but if not stable why bother?

yolhan_mannes · March 16, 2025, 8:19am

It depends but for discontinuous galerkin problem (block tridiag) you don’t need that much stability of the linear problem as an example. In this case, if you use monomial bases you get dense blocks and if you use lobatto-legendre you get diagonal matricies in block and I think it’s better to just make everything sparse in this case

rveltz · March 16, 2025, 8:21am

Or maybe as a preconditioner

stevengj · March 16, 2025, 12:02pm

Using * instead of .* means that the multiplication is not fused with the .-= — it allocates a separate array, in a separate loop. See the “more dots” performance tip. You might want to read this article about Julia’s “dot” notation and what it does.

Also, b[n*(i-1)+1:n*i] allocates a new array, and also b[n*i+1:n*(i+1)] on the right-hand side, since you aren’t using @views. See the “consider using views” performance tip. Probably best to just put @views in front of function so that you use it everywhere in the function, since right now you are allocating lots of copies for slices. (Then you can get rid of the @view calls.)

I thought you didn’t want to overwrite the input array?

Why are you using a separate pre-allocated buffer D_buf[i] for every loop iteration, rather than just allocating a single buffer and re-using it?

Not using mul! with a buffer? Similarly for the U[i] * operation later?

LaurentPlagne · March 16, 2025, 12:14pm

Actually there is but it is a bit tricky to understand. The “normal” Thomas Algorithm (TA) implementation is automatically transformed into a TA applied on fixed size pack of numbers where usual operations (+,-,*,/…) are transformed (at compile time) into the corresponding SIMD operations. This generated SIMD version of TA is then applied in parallel (threads) to a collection of different data (pmap operation).
More details here

The main idea here is to understand that the TA is intrinsically sequential but that it can be applied in parallel to different tridiagonal blocks of the matrix. It is pretty easy to do this using threads but the SIMD part (finer level of //ism) is more difficult and implies to adapt (interleave) the data layout of the inputs (matrix and vectors).

I understand that this C++ code is not trivial to reproduce/translate in Julia but I think that it may provide a rather solid upper bound for performance (answer to your initial question about potential performance improvement).

karei · March 16, 2025, 1:14pm

I know Fuse vectorized operations. But here * is the vector product of the matrix L and the vector b, not the scalar product (.*). Replacing vector product with scalar product will lead to wrong result.

I know this, but I also know that array views don’t participate in calculations as quickly as array slices (which create new arrays). So I’m hesitant to choose between slices or views here.
But I just tested it and you are right, it’s faster to use the view here.

Thank you very much for this suggestion, it cut my allocation in half again (from 325.692 μs Memory estimate: 114.73 KiB, allocs estimate: 893 to 308.563 μs Memory estimate: 31.06 KiB, allocs estimate: 298) Awesome!

Yes, I simply didn’t think of that, thanks for the suggestion. I’ve revised it.

Yes, I added it later and it’s an area that could be improved.

Now the code is highly optimized.

@views function block_thomas_tridiagonal!(x::AbstractVector, L::AbstractVector{T}, D::AbstractVector{T},
    U::AbstractVector{T}, b::AbstractVector) where {T}
    n = size(D[1], 1) # each block has size n*n
    m = length(D) # num of blocks
    (; lu_buf, D_buf, b_buf) = get_block_thomas_buffer(n, m)

    # Forward elimination
    for i = 1:m-1
        lu_buf[i] = lu!(D[i])
        rdiv!(L[i], lu_buf[i])
        D[i+1] .-= mul!(D_buf, L[i], U[i])
        b[n*i+1:n*(i+1)] .-= mul!(b_buf, L[i], b[n*(i-1)+1:n*i])
    end

    # Backward substitution
    ldiv!(x[n*(m-1)+1:n*m], lu!(D[m]), b[n*(m-1)+1:n*m])
    for i = m-1:-1:1
        ldiv!(x[n*(i-1)+1:n*i], lu_buf[i],
            b[n*(i-1)+1:n*i] - mul!(b_buf, U[i], x[n*i+1:n*(i+1)]))
    end

    return x
end

with a buffer system:

mutable struct BlockThomas
    n::Int
    m::Int
    lu_buf::Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}
    D_buf::Matrix{Float64}
    b_buf::Vector{Float64}
    function BlockThomas(n, m)
        lu_buf = Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}(undef, m)
        D_buf = Matrix{Float64}(undef, n, n)
        b_buf = Vector{Float64}(undef, n)
        return new(n, m, lu_buf, D_buf, b_buf)
    end
end

const BlockThomasBuf = Vector{BlockThomas}(undef, nthreads())

function get_block_thomas_buffer(n::Int, m::Int)
    tid = threadid()
    return if !isassigned(BlockThomasBuf, tid)
        BlockThomasBuf[tid] = BlockThomas(n, m)
    elseif BlockThomasBuf[tid].n != n || BlockThomasBuf[tid].m != m
        BlockThomasBuf[tid] = BlockThomas(n, m)
    else
        BlockThomasBuf[tid]
    end
end

Thanks for your help stevengj.

karei · March 16, 2025, 1:20pm

Thank you for your detailed response. It is a bit difficult for me to understand what you are talking about and I don’t feel that I can do the simd version at my level.

LaurentPlagne · March 16, 2025, 1:28pm

No worry . Like I said it may only be useful if you really want to get an upper bound for the performance (I put a lot of work into this since it was an essential part of a large industrial software).

A Julia translation of Legolas++ has been on my mind for a few years now (discussed this with @Elrod some time ago), but I could not find the time yet …

stevengj · March 16, 2025, 4:54pm

In that case you could use mul! with a pre-allocated buffer.

This shouldn’t be a problem for contiguous views like the ones here.

randy854 · March 17, 2025, 4:02am

karei:

I did some searching but it seems that Julia doesn’t implement a specific algorithm for block tridiagonal matrices. So I implemented one myself. I would like to know:

Have I missed some existing algorithm implementations for block tridiagonal matrices?

is there any room for improvement of the solver function I wrote?
function block_thomas_tridiagonal(A::AbstractVector{T}, B::AbstractVector{T}, C::AbstractVector{T},
    d::AbstractVector) where {T}
    B = copy(B)
    n = size(B[1], 1) # rows of each block
    m = length(B) # number of blocks

    # Forward elimination
    for i = 2:m # (no speed up using inbounds+simd)
        multiplier = A[i-1] * inv(B[i-1])
        B[i] -= multiplier * C[i-1]
        d[n*(i-1)+1:n*i] .-= multiplier * d[n*(i-2)+1:n*(i-1)]
    end

    # Backward substitution
    x = similar(d)
    x[n*(m-1)+1:n*m] = B[m] \ d[n*(m-1)+1:n*m] # no speed up using .=
    for i = m-1:-1:1
        x[n*(i-1)+1:n*i] = B[i] \ (d[n*(i-1)+1:n*i] - C[i] * x[n*i+1:n*(i+1)])
    end

    return x
end
This function is specifically designed to handle the case where every block has the same size. It usually takes a Vector{Matrix{Float64}} as input and solves each block one by one.

Your block tridiagonal solver is a good start; improve it by using LU factorization for efficiency, ensuring type stability, adding error handling, and considering SIMD/multithreading for further optimization, while maintaining clear documentation and thorough testing.

Topic		Replies	Views
Solvers for Block-Tridiagonal Numerics linearalgebra	14	252	June 24, 2025
Custom linear system solver New to Julia question	4	1451	December 9, 2020
Thomas algorithm for banded matrices General Usage	13	2321	March 31, 2020
Native (oversimplified) Julia gemm implementation Performance proposal	6	1174	August 15, 2019
LU factorisation of a block tridiagonal square matrix General Usage linearalgebra , general , speed-optimization	3	186	January 17, 2025

How to improve the Thomas algorithm for block tridiagonal matrices

Related topics