I did some searching but it seems that Julia doesn’t implement a specific algorithm for block tridiagonal matrices. So I implemented one myself. I would like to know:
Have I missed some existing algorithm implementations for block tridiagonal matrices?
is there any room for improvement of the solver function I wrote?
function block_thomas_tridiagonal(A::AbstractVector{T}, B::AbstractVector{T}, C::AbstractVector{T},
d::AbstractVector) where {T}
B = copy(B)
n = size(B[1], 1) # rows of each block
m = length(B) # number of blocks
# Forward elimination
for i = 2:m # (no speed up using inbounds+simd)
multiplier = A[i-1] * inv(B[i-1])
B[i] -= multiplier * C[i-1]
d[n*(i-1)+1:n*i] .-= multiplier * d[n*(i-2)+1:n*(i-1)]
end
# Backward substitution
x = similar(d)
x[n*(m-1)+1:n*m] = B[m] \ d[n*(m-1)+1:n*m] # no speed up using .=
for i = m-1:-1:1
x[n*(i-1)+1:n*i] = B[i] \ (d[n*(i-1)+1:n*i] - C[i] * x[n*i+1:n*(i+1)])
end
return x
end
This function is specifically designed to handle the case where every block has the same size. It usually takes a Vector{Matrix{Float64}} as input and solves each block one by one.
This parallel (thread+simd) C++ implementation may help you to implement a performant Julia implementation of Thomas Algorithm. Be aware that for large problem exceeding cache size, the algorihm is memory bound:
Don’t use explicit matrix inversion. You could use A[i-1] / inv(B[i-1]). If the blocks are all the same size, as you seem to be assuming here, you could preallocate buffers for the LU factorization, the multipliers, and the other matrix multiplications.
That’s because .= is for elementwise/broadcasting operations, not matrix operations. If you want to do the latter in-place, use ldiv! and mul! etc.
Beware that this algorithm may be numerically unstable, IIRC, e.g. if any of the blocks are nearly singular.
It is unfortunately. It’s even unstable for a non-block symmetric tridiagonal. (And the block case is going be worse with the explicit inverse you pointed out.) You really need some form of pivoting for this to guarantee stability. I put an example for the factorization variant used in ldlt in a post last year.
I just showed the factorization backward error, but the thing that seems more troubling to me is that the symmetric tridiagonal solver in Julia uses an unstable algorithm. With a relative residual for the solver, the example is:
LAPACK avoids this by using LU with partial pivoting for its tridiagonal solver. However, in addition to the permutation, you pick up an extra superdiagonal in U. So it isn’t a fully in place algorithm like the elegant algorithm implemented in ldlt!. The alternative is Bunch’s method that uses 2\times 2 diagonal blocks mentioned in my other post. You don’t need any permutations, but you do need to accept 2\times 2 blocks in D. Bunch’s algorithm has the further advantage of preserving symmetry, so by finding the eigenvalues in D you can reliably compute the inertia of a matrix. However, it also requires extra storage and you can’t represent the triangular factor as a UnitLowerTriangular a SymTriadiagonal as is done now. The structure of the factors doesn’t align as nicely with existing types in LinearAlgebra, which is annoying even if you are fine giving up on a fully in-place factorization.
Thanks for the link. But I took a closer look in there and there is no implementation of Thomas’ algorithm for solving block tridiagonal matrices. There is only the normal Thomas algorithm. But it did help me improve the Thomas algorithm I originally wrote, thank you.
Thank you for the reply. I spent a little time on how to improve this algorithm based on your suggestion. Here is a comparison between the original and the improved version
original
function block_thomas_tridiagonal(L::AbstractVector{T}, D::AbstractVector{T}, U::AbstractVector{T},
b::AbstractVector) where {T}
D = deepcopy(D)
n = size(D[1], 1) # each block has size n*n
m = length(D) # num of blocks
# Forward elimination
for i = 1:m-1
multiplier = L[i] * inv(D[i])
D[i+1] -= multiplier * U[i]
b[n*i+1:n*(i+1)] .-= multiplier * b[n*(i-1)+1:n*i]
end
# Backward substitution
x = similar(b)
x[n*(m-1)+1:n*m] = D[m] \ b[n*(m-1)+1:n*m]
for i = m-1:-1:1
x[n*(i-1)+1:n*i] = D[i] \ (b[n*(i-1)+1:n*i] - U[i] * x[n*i+1:n*(i+1)])
end
return x
end
and improved
function block_thomas_tridiagonal!(x::AbstractVector, L::AbstractVector{T}, D::AbstractVector{T},
U::AbstractVector{T}, b::AbstractVector) where {T}
n = size(D[1], 1) # each block has size n*n
m = length(D) # num of blocks
(; lu_buf, D_buf) = get_block_thomas_buffer(n, m)
# Forward elimination
for i = 1:m-1
lu_buf[i] = lu!(D[i])
rdiv!(L[i], lu_buf[i])
D[i+1] .-= mul!(D_buf[i], L[i], U[i])
b[n*i+1:n*(i+1)] .-= L[i] * b[n*(i-1)+1:n*i]
end
# Backward substitution
ldiv!(@view(x[n*(m-1)+1:n*m]), lu!(D[m]), b[n*(m-1)+1:n*m])
for i = m-1:-1:1
ldiv!(@view(x[n*(i-1)+1:n*i]), lu_buf[i], (b[n*(i-1)+1:n*i] - U[i] * x[n*i+1:n*(i+1)]))
end
return x
end
where the buffer is a vector of structures I wrote myself
mutable struct BlockThomas
n::Int
m::Int
lu_buf::Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}
D_buf::Vector{Matrix{Float64}}
function BlockThomas(n, m)
lu_buf = Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}(undef, m)
D_buf = [mat(n, n) for _ = 1:m]
return new(n, m, lu_buf, D_buf)
end
end
const BlockThomasBuf = Vector{BlockThomas}(undef, nthreads())
function get_block_thomas_buffer(n::Int, m::Int)
tid = threadid()
return if !isassigned(BlockThomasBuf, tid)
BlockThomasBuf[tid] = BlockThomas(n, m)
elseif BlockThomasBuf[tid].n != n || BlockThomasBuf[tid].m != m
BlockThomasBuf[tid] = BlockThomas(n, m)
else
BlockThomasBuf[tid]
end
end
The average time for the benchmark changed from 460.868 μs Memory estimate: 692.58 KiB, allocs estimate: 1893 to 325.692 μs Memory estimate: 114.73 KiB, allocs estimate: 893., a 29% savings on this medium-sized example.
For the benefit of anyone who sees this post later, I’ve put both pieces of code in my repository.
The improved version does not explicitly use matrix inversion, but the two seem to be about the same in accuracy.
Random.seed!(1)
n = 10
m = 100
A = [rand(BigFloat, n, n) for _ = 1:m-1]
B = [rand(BigFloat, n, n) for _ = 1:m]
C = [rand(BigFloat, n, n) for _ = 1:m-1]
d = rand(BigFloat, m * n)
x = similar(d);
A_float64 = [Float64.(i) for i in A]
B_float64 = [Float64.(i) for i in B]
C_float64 = [Float64.(i) for i in C]
d_float64 = Float64.(d)
x_float64 = similar(d_float64)
original_result = block_thomas_tridiagonal(A, B, C, d)
result_float64 = block_thomas_tridiagonal(A_float64, B_float64, C_float64, d_float64)
error_norm = norm(original_result - result_float64, 1)
println("1-norm of the error: ", error_norm)
# original_result = block_thomas_tridiagonal!(x, A, B, C, d)
# result_float64 = block_thomas_tridiagonal!(x_float64, A_float64, B_float64, C_float64, d_float64)
# error_norm = norm(original_result - result_float64, 1)
# println("1-norm of the error: ", error_norm)
original: 1-norm of the error: 5.06330e-09
improved: 1-norm of the error: 1.1942802e-08
If you change the seed, sometimes the improved version has a smaller error.
But it does demonstrate that the Block Thomas algorithm has accumulated errors of the order of 1e-9 on this medium-sized example.
May I ask if there is a better algorithm for solving block tridiagonal matrix? I would like to trade-off accuracy and performance and utilize sparsity instead of solving the entire matrix.
If your block are sparse, you should just build the whole sparse matrix and use either KLU or Kyrlov. If your blocks are Dense, Thomas will be the best (even though not stable as said before).
It depends but for discontinuous galerkin problem (block tridiag) you don’t need that much stability of the linear problem as an example. In this case, if you use monomial bases you get dense blocks and if you use lobatto-legendre you get diagonal matricies in block and I think it’s better to just make everything sparse in this case
Using * instead of .* means that the multiplication is not fused with the .-= — it allocates a separate array, in a separate loop. See the “more dots” performance tip. You might want to read this article about Julia’s “dot” notation and what it does.
Also, b[n*(i-1)+1:n*i] allocates a new array, and also b[n*i+1:n*(i+1)] on the right-hand side, since you aren’t using @views. See the “consider using views” performance tip. Probably best to just put @views in front of function so that you use it everywhere in the function, since right now you are allocating lots of copies for slices. (Then you can get rid of the @view calls.)
I thought you didn’t want to overwrite the input array?
Why are you using a separate pre-allocated buffer D_buf[i] for every loop iteration, rather than just allocating a single buffer and re-using it?
Not using mul! with a buffer? Similarly for the U[i] * operation later?
Actually there is but it is a bit tricky to understand. The “normal” Thomas Algorithm (TA) implementation is automatically transformed into a TA applied on fixed size pack of numbers where usual operations (+,-,*,/…) are transformed (at compile time) into the corresponding SIMD operations. This generated SIMD version of TA is then applied in parallel (threads) to a collection of different data (pmap operation).
More details here
The main idea here is to understand that the TA is intrinsically sequential but that it can be applied in parallel to different tridiagonal blocks of the matrix. It is pretty easy to do this using threads but the SIMD part (finer level of //ism) is more difficult and implies to adapt (interleave) the data layout of the inputs (matrix and vectors).
I understand that this C++ code is not trivial to reproduce/translate in Julia but I think that it may provide a rather solid upper bound for performance (answer to your initial question about potential performance improvement).
I know Fuse vectorized operations. But here * is the vector product of the matrix L and the vector b, not the scalar product (.*). Replacing vector product with scalar product will lead to wrong result.
I know this, but I also know that array views don’t participate in calculations as quickly as array slices (which create new arrays). So I’m hesitant to choose between slices or views here.
But I just tested it and you are right, it’s faster to use the view here.
Thank you very much for this suggestion, it cut my allocation in half again (from 325.692 μs Memory estimate: 114.73 KiB, allocs estimate: 893 to 308.563 μs Memory estimate: 31.06 KiB, allocs estimate: 298) Awesome!
Yes, I simply didn’t think of that, thanks for the suggestion. I’ve revised it.
Yes, I added it later and it’s an area that could be improved.
Now the code is highly optimized.
@views function block_thomas_tridiagonal!(x::AbstractVector, L::AbstractVector{T}, D::AbstractVector{T},
U::AbstractVector{T}, b::AbstractVector) where {T}
n = size(D[1], 1) # each block has size n*n
m = length(D) # num of blocks
(; lu_buf, D_buf, b_buf) = get_block_thomas_buffer(n, m)
# Forward elimination
for i = 1:m-1
lu_buf[i] = lu!(D[i])
rdiv!(L[i], lu_buf[i])
D[i+1] .-= mul!(D_buf, L[i], U[i])
b[n*i+1:n*(i+1)] .-= mul!(b_buf, L[i], b[n*(i-1)+1:n*i])
end
# Backward substitution
ldiv!(x[n*(m-1)+1:n*m], lu!(D[m]), b[n*(m-1)+1:n*m])
for i = m-1:-1:1
ldiv!(x[n*(i-1)+1:n*i], lu_buf[i],
b[n*(i-1)+1:n*i] - mul!(b_buf, U[i], x[n*i+1:n*(i+1)]))
end
return x
end
with a buffer system:
mutable struct BlockThomas
n::Int
m::Int
lu_buf::Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}
D_buf::Matrix{Float64}
b_buf::Vector{Float64}
function BlockThomas(n, m)
lu_buf = Vector{LU{Float64,Matrix{Float64},Vector{Int64}}}(undef, m)
D_buf = Matrix{Float64}(undef, n, n)
b_buf = Vector{Float64}(undef, n)
return new(n, m, lu_buf, D_buf, b_buf)
end
end
const BlockThomasBuf = Vector{BlockThomas}(undef, nthreads())
function get_block_thomas_buffer(n::Int, m::Int)
tid = threadid()
return if !isassigned(BlockThomasBuf, tid)
BlockThomasBuf[tid] = BlockThomas(n, m)
elseif BlockThomasBuf[tid].n != n || BlockThomasBuf[tid].m != m
BlockThomasBuf[tid] = BlockThomas(n, m)
else
BlockThomasBuf[tid]
end
end
Thank you for your detailed response. It is a bit difficult for me to understand what you are talking about and I don’t feel that I can do the simd version at my level.
No worry . Like I said it may only be useful if you really want to get an upper bound for the performance (I put a lot of work into this since it was an essential part of a large industrial software).
A Julia translation of Legolas++ has been on my mind for a few years now (discussed this with @Elrod some time ago), but I could not find the time yet …
Your block tridiagonal solver is a good start; improve it by using LU factorization for efficiency, ensuring type stability, adding error handling, and considering SIMD/multithreading for further optimization, while maintaining clear documentation and thorough testing.