Block Diagonal Factorization Performance

I’m looking to leverage the speedup that is possible by exploiting block diagonal structures during matrix factorization. Since I’m relying on existing code and inherent multiple dispatch, I need the solution to 1) use a sub-type of AbstractMatrix and 2) be a new method for existing factoring functions (e.g. cholesky). My use case is not necessarily enormous matrices, but more modest ones that need to be operated on as quickly as possible to increase throughput.

Below is a comparison of the tools I’m aware of, namely BlockDiagonals and BlockArrays. I really like the interface these use, but am aware that BlockDiagonals is largely abandoned at this point. I’d love to learn about new tools as well and hear about improvements to what I’m currently attempting.

using BenchmarkTools
using LinearAlgebra, StaticArrays
using BlockDiagonals, BlockArrays

using Random
Random.seed!(1234)

# create a matrix with a valid cholesky decomposition
function randomPSD(n::Int)
    rand_mat = rand(n, n)
    rand_chol = Cholesky(rand_mat, :U, 0)
    return rand_chol.U * rand_chol.L
end

# this is much faster than cat(A, B, dims=(1,2))
# obviously I could just stamp it out directly with only a single allocation
function diag_concat(A::AbstractMatrix, B::AbstractMatrix)
    T = Base.promote_eltypeof(A, B)
    dimsA, dimsB = size(A), size(B)
    return hcat(vcat(A, zeros(T, dimsB[1], dimsA[2])), vcat(zeros(T, dimsA[1], dimsB[2]), B))
end

n = [2, 3, 1, 2]
m = [randomPSD(i) for i in n]

println("Construction:")
println("Dense")
fm = @btime reduce(diag_concat, $m);
println("BlockDiagonal")
bd = @btime BlockDiagonal($m);
println("BlockMatrix")
bb = @btime mortar(Diagonal($m));

Construction:
Dense
526.316 ns (15 allocations: 3.23 KiB)
BlockDiagonal
1.541 ns (0 allocations: 0 bytes)
BlockMatrix
678.105 ns (29 allocations: 2.89 KiB)

Construction of a BlockDiagonal is “free” while a BlockMatrix is much heavier. Now let’s look at the cholesky performance as a factorization example. Note that I think I have to define this myself for a Diagonal BlockMatrix.

function LinearAlgebra.cholesky(x::BlockMatrix{T,<:Diagonal}) where {T}
    C = mortar(Diagonal([cholesky(x[Block(i, i)]).factors for i in 1:BlockArrays.blocksize(x, 1)]))
    return Cholesky(C, 'U', 0)
end

println("Cholesky:")
println("Reference")
@btime cholesky.($m);
println("Dense")
@btime cholesky($fm);
println("BlockDiagonal")
@btime cholesky($bd);
println("BlockMatrix")
@btime cholesky($bb);

Cholesky:
Reference
302.657 ns (5 allocations: 544 bytes)
Dense
203.814 ns (1 allocation: 576 bytes)
BlockDiagonal
323.370 ns (9 allocations: 528 bytes)
BlockMatrix
995.800 ns (34 allocations: 3.34 KiB)

BlockDiagonal performs well and the implementation could probably be improved further to exactly match the reference. BlockMatrix is again heavy because of the allocations in the construction of the output object. Overall not even the reference is able to match the direct “dumb” decomposition of the dense array simply because performance is driven by the allocations, not the math.

At this point I switched to trying StaticArrays.

function diag_concat(A::SMatrix{RA,CA}, B::SMatrix{RB,CB}) where {RA,CA,RB,CB}
    T = Base.promote_eltypeof(A, B)
    return hcat(vcat(A, zeros(SMatrix{RB,CA,T})), vcat(zeros(SMatrix{RA,CB,T}), B))
end

ms = [SMatrix{size(mi)...}(mi) for mi in m]

println("Static Construction:")
println("Dense")
fms = @btime reduce(diag_concat, $ms);
println("BlockDiagonal")
bds = @btime BlockDiagonal($ms);
println("BlockMatrix")
bbs = @btime mortar(Diagonal($ms));

Static Construction:
Dense
83.849 ns (3 allocations: 1.03 KiB)
BlockDiagonal
1.500 ns (0 allocations: 0 bytes)
BlockMatrix
5.077 μs (41 allocations: 3.27 KiB)

Construction of the BlockDiagonal remained free as expected, but the BlockMatrix construction got even worse - presumably due to type instabilities with static matrices of different sizes.

println("Static Cholesky:")
println("Reference")
@btime cholesky($(ms[1])), cholesky($(ms[2])), cholesky($(ms[3])), cholesky($(ms[4]));
println("Dense")
@btime cholesky($fms);
println("BlockDiagonal")
@btime cholesky($bds);
println("BlockMatrix")
@btime cholesky($bbs);

Static Cholesky:
Reference
11.762 ns (0 allocations: 0 bytes)
Dense
44.402 ns (0 allocations: 0 bytes)
BlockDiagonal
4.744 μs (24 allocations: 1.05 KiB)
BlockMatrix
7.861 μs (63 allocations: 4.61 KiB)

Here we finally see the expected performance improvement between the reference and dense implementations - 4x faster by factoring separately. Also cholesky is much faster in general using StaticArrays. Unfortunately, neither BlockDiagonal nor BlockMatrix perform well at all - both are worse than using a traditional matrix. I expect this is again because of type instabilities with multiple static matrices of different sizes.

I believe this is a space where I need to be using StaticArrays to see performance improvements, but I’m not sure how to do so with existing tools. At the very least, I don’t want the wheels to come off if StaticArrays are passed in. For many other reasons, I’d really like to leverage the BlockArrays framework, but the performance I’m seeing isn’t great in this application. Looking for tips on how to make this work without reinventing another block diagonal type.