I’m looking to leverage the speedup that is possible by exploiting block diagonal structures during matrix factorization. Since I’m relying on existing code and inherent multiple dispatch, I need the solution to 1) use a sub-type of AbstractMatrix
and 2) be a new method for existing factoring functions (e.g. cholesky
). My use case is not necessarily enormous matrices, but more modest ones that need to be operated on as quickly as possible to increase throughput.
Below is a comparison of the tools I’m aware of, namely BlockDiagonals
and BlockArrays
. I really like the interface these use, but am aware that BlockDiagonals
is largely abandoned at this point. I’d love to learn about new tools as well and hear about improvements to what I’m currently attempting.
using BenchmarkTools
using LinearAlgebra, StaticArrays
using BlockDiagonals, BlockArrays
using Random
Random.seed!(1234)
# create a matrix with a valid cholesky decomposition
function randomPSD(n::Int)
rand_mat = rand(n, n)
rand_chol = Cholesky(rand_mat, :U, 0)
return rand_chol.U * rand_chol.L
end
# this is much faster than cat(A, B, dims=(1,2))
# obviously I could just stamp it out directly with only a single allocation
function diag_concat(A::AbstractMatrix, B::AbstractMatrix)
T = Base.promote_eltypeof(A, B)
dimsA, dimsB = size(A), size(B)
return hcat(vcat(A, zeros(T, dimsB[1], dimsA[2])), vcat(zeros(T, dimsA[1], dimsB[2]), B))
end
n = [2, 3, 1, 2]
m = [randomPSD(i) for i in n]
println("Construction:")
println("Dense")
fm = @btime reduce(diag_concat, $m);
println("BlockDiagonal")
bd = @btime BlockDiagonal($m);
println("BlockMatrix")
bb = @btime mortar(Diagonal($m));
Construction:
Dense
526.316 ns (15 allocations: 3.23 KiB)
BlockDiagonal
1.541 ns (0 allocations: 0 bytes)
BlockMatrix
678.105 ns (29 allocations: 2.89 KiB)
Construction of a BlockDiagonal
is “free” while a BlockMatrix
is much heavier. Now let’s look at the cholesky performance as a factorization example. Note that I think I have to define this myself for a Diagonal
BlockMatrix
.
function LinearAlgebra.cholesky(x::BlockMatrix{T,<:Diagonal}) where {T}
C = mortar(Diagonal([cholesky(x[Block(i, i)]).factors for i in 1:BlockArrays.blocksize(x, 1)]))
return Cholesky(C, 'U', 0)
end
println("Cholesky:")
println("Reference")
@btime cholesky.($m);
println("Dense")
@btime cholesky($fm);
println("BlockDiagonal")
@btime cholesky($bd);
println("BlockMatrix")
@btime cholesky($bb);
Cholesky:
Reference
302.657 ns (5 allocations: 544 bytes)
Dense
203.814 ns (1 allocation: 576 bytes)
BlockDiagonal
323.370 ns (9 allocations: 528 bytes)
BlockMatrix
995.800 ns (34 allocations: 3.34 KiB)
BlockDiagonal
performs well and the implementation could probably be improved further to exactly match the reference. BlockMatrix
is again heavy because of the allocations in the construction of the output object. Overall not even the reference is able to match the direct “dumb” decomposition of the dense array simply because performance is driven by the allocations, not the math.
At this point I switched to trying StaticArrays
.
function diag_concat(A::SMatrix{RA,CA}, B::SMatrix{RB,CB}) where {RA,CA,RB,CB}
T = Base.promote_eltypeof(A, B)
return hcat(vcat(A, zeros(SMatrix{RB,CA,T})), vcat(zeros(SMatrix{RA,CB,T}), B))
end
ms = [SMatrix{size(mi)...}(mi) for mi in m]
println("Static Construction:")
println("Dense")
fms = @btime reduce(diag_concat, $ms);
println("BlockDiagonal")
bds = @btime BlockDiagonal($ms);
println("BlockMatrix")
bbs = @btime mortar(Diagonal($ms));
Static Construction:
Dense
83.849 ns (3 allocations: 1.03 KiB)
BlockDiagonal
1.500 ns (0 allocations: 0 bytes)
BlockMatrix
5.077 μs (41 allocations: 3.27 KiB)
Construction of the BlockDiagonal
remained free as expected, but the BlockMatrix
construction got even worse - presumably due to type instabilities with static matrices of different sizes.
println("Static Cholesky:")
println("Reference")
@btime cholesky($(ms[1])), cholesky($(ms[2])), cholesky($(ms[3])), cholesky($(ms[4]));
println("Dense")
@btime cholesky($fms);
println("BlockDiagonal")
@btime cholesky($bds);
println("BlockMatrix")
@btime cholesky($bbs);
Static Cholesky:
Reference
11.762 ns (0 allocations: 0 bytes)
Dense
44.402 ns (0 allocations: 0 bytes)
BlockDiagonal
4.744 μs (24 allocations: 1.05 KiB)
BlockMatrix
7.861 μs (63 allocations: 4.61 KiB)
Here we finally see the expected performance improvement between the reference and dense implementations - 4x faster by factoring separately. Also cholesky
is much faster in general using StaticArrays
. Unfortunately, neither BlockDiagonal
nor BlockMatrix
perform well at all - both are worse than using a traditional matrix. I expect this is again because of type instabilities with multiple static matrices of different sizes.
I believe this is a space where I need to be using StaticArrays
to see performance improvements, but I’m not sure how to do so with existing tools. At the very least, I don’t want the wheels to come off if StaticArrays
are passed in. For many other reasons, I’d really like to leverage the BlockArrays
framework, but the performance I’m seeing isn’t great in this application. Looking for tips on how to make this work without reinventing another block diagonal type.