Block Tridiagonal Linear System Performance: Custom vs Package Implementations
I recently submitted an issue to BlockBandedMatrices.jl about this, but thought it might be interesting to the broader community.
Problem
I have symmetric block tridiagonal linear systems that should theoretically solve in O(T·n³) time, where T is the number of blocks and n is the block dimension. However, when using BlockBandedMatrices.jl, I found that SparseArrays was actually faster. This suggests either:
- I’m doing something wrong, or
- The block structure isn’t being properly exploited
I implemented custom algorithms based on Golub & Van Loan’s Matrix Computations and found them to be much faster than both approaches.
Here is a MWRE that I will also include as a Pluto Notebook.
Setup
using Random, LinearAlgebra, BlockBandedMatrices, SparseArrays
using BlockArrays: getblock
using BenchmarkTools
# Parameters
k = 2 # block size
n = 5 # number of blocks
# Create SPD blocks
function make_spd_block(k)
A = randn(k, k)
return A'A + k*I # ensure SPD
end
Bs = [make_spd_block(k) for _ in 1:n] # diagonal blocks
Cs = [randn(k, k) for _ in 1:(n-1)] # off-diagonal blocks
# Build full matrix A
function build_full_block_tridiag(Bs, Cs)
k, n = size(Bs[1], 1), length(Bs)
A = zeros(k*n, k*n)
for i in 1:n
A[(k*(i-1)+1):(k*i), (k*(i-1)+1):(k*i)] .= Bs[i]
if i < n
A[(k*(i-1)+1):(k*i), (k*i+1):(k*(i+1))] .= Cs[i]'
A[(k*i+1):(k*(i+1)), (k*(i-1)+1):(k*i)] .= Cs[i]
end
end
return Symmetric(A)
end
A_dense = build_full_block_tridiag(Bs, Cs)
b = randn(k * n)
Method 1: Custom Block Cholesky (brittle but fast)
function custom_block_cholesky(Bs, Cs, b)
n = length(Bs)
k = size(Bs[1], 1)
Ls = Vector{Matrix{Float64}}(undef, n)
Loffs = Vector{Matrix{Float64}}(undef, n - 1)
# Cholesky factorization
Ls[1] = cholesky(Bs[1]).L
for i in 2:n
# Solve Cᵢ₋₁ = Lᵢ₋₁ Lᵢ₋₁ᵗ ⇒ Loffs[i-1] = Cᵢ₋₁ * Lᵢ₋₁⁻ᵗ
Loffs[i-1] = Cs[i-1] * inv(Ls[i-1]')
S = Bs[i] - Loffs[i-1] * Loffs[i-1]'
Ls[i] = cholesky(S).L
end
# Forward substitution
y = zeros(k * n)
for i in 1:n
idx = (k*(i-1)+1):(k*i)
y[idx] .= b[idx]
if i > 1
prev_idx = (k*(i-2)+1):(k*(i-1))
y[idx] .-= Loffs[i-1] * y[prev_idx]
end
y[idx] .= Ls[i] \ y[idx]
end
# Backward substitution
x = zeros(k * n)
for i in n:-1:1
idx = (k*(i-1)+1):(k*i)
x[idx] .= y[idx]
if i < n
next_idx = (k*i+1):(k*(i+1))
x[idx] .-= Loffs[i]' * x[next_idx]
end
x[idx] .= Ls[i]' \ x[idx]
end
return x
end
Method 2: Custom Block LU (more robust)
function custom_block_lu(As, Bs, Cs, b)
n = length(As) # number of block rows
k = size(As[1], 1) # block size
# Storage for LU factors
Us = Vector{Matrix{Float64}}(undef, n) # Upper diagonal blocks
Ls = Vector{Matrix{Float64}}(undef, n-1) # Lower off-diagonal blocks
# First block: U₁ = A₁
Us[1] = copy(As[1])
for i = 1:n-1
# Compute lower off-diagonal: Lᵢ = Bᵢ * Uᵢ⁻¹
Ls[i] = Bs[i] / Us[i]
# Update next diagonal block: Uᵢ₊₁ = Aᵢ₊₁ - Lᵢ * Cᵢ
Us[i+1] = As[i+1] - Ls[i] * Cs[i]
end
# Forward Substitution
y = zeros(k * n)
for i = 1:n
idx = (k*(i-1)+1):(k*i)
y[idx] = b[idx]
# Subtract contribution from previous block
if i > 1
prev_idx = (k*(i-2)+1):(k*(i-1))
y[idx] -= Ls[i-1] * y[prev_idx]
end
end
# Backward Substitution
x = zeros(k * n)
for i = n:-1:1
idx = (k*(i-1)+1):(k*i)
x[idx] = y[idx]
# Subtract contribution from next block
if i < n
next_idx = (k*i+1):(k*(i+1))
x[idx] -= Cs[i] * x[next_idx]
end
# Solve with diagonal block
x[idx] = Us[i] \ x[idx]
end
return x
end
Method 3: BlockBandedMatrices.jl
block_sizes = fill(k, n)
A_bbm = BlockBandedMatrix{Float64}(undef, block_sizes, block_sizes, (1, 1))
for i in 1:n
getblock(A_bbm, i, i) .= Bs[i]
end
for i in 1:(n-1)
getblock(A_bbm, i+1, i) .= Cs[i]
getblock(A_bbm, i, i+1) .= Cs[i]'
end
x_bbm = A_bbm \ b
Method 4: SparseArrays.jl
A_sparse = sparse(A_dense)
x_sparse = A_sparse \ b
Results
# Method 1: Custom Block Cholesky
x_custom = custom_block_cholesky(Bs, Cs, b)
# Method 2: Custom Block LU
c_prime = [cs' for cs in Cs]
x_custom_lu = custom_block_lu(Bs, Cs, c_prime, b)
# Verify solutions are identical
println("Custom Cholesky vs BlockBanded: ", norm(x_custom - x_bbm))
println("Custom Cholesky vs Sparse: ", norm(x_custom - x_sparse))
println("Sparse vs BlockBanded: ", norm(x_sparse - x_bbm))
println("Custom Cholesky vs Custom LU: ", norm(x_custom_lu - x_custom))
# Performance comparison
@btime custom_block_cholesky($Bs, $Cs, $b)
@btime $A_bbm \ $b
@btime $A_sparse \ $b
@btime custom_block_lu($Bs, $Cs, $c_prime, $b)
Questions
-
Am I using BlockBandedMatrices.jl correctly? Is there a better way to construct or solve with these matrices?
-
Why is SparseArrays faster than BlockBandedMatrices? Shouldn’t the block structure provide advantages?
Any insights or suggestions would be greatly appreciated!