# Efficient sparse-dense matrix multiplication for diagonal-like sparse matrices

I want to implement an efficient and in-place sparse-dense matrix multiplication for sparse matrices with only a few non-zero offset diagonals (known as DIA sparse matrices). Here however, the number of non-zero diagonals and their offsets are known in advance, and do not change throughout the computation. An example of such a matrix is

``````A = [[0, 0, 3, 0],
[1, 0, 0, 2],
[0, 2, 0, 0],
[0, 0, 1, 0]]
# A is stored as
offsets = [-1, 2]
data = [[1,2,1,0],[0,0,3,2]]
``````

I have written a `structure` to store such matrices, which looks like

``````struct SparseMatrixDIA{T,Nd}    # Nd is the number of non-zero diagonals
offsets::NTuple{Nd,Int}     # Offset of each diagonal (0 is the main diagonal)
data::NTuple{Nd,Vector{T}}  # Values in each diagonal
end
``````

I have also written left-side and right-side sparse-dense matrix multiplication functions, following the structure of `LinearAlgebra.mul!`:

``````"""Compute Y = aAX + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, A::SparseMatrixDIA{T,Nd}, X::Matrix{T}, a::T, b::T) where {T,Nd}
@inbounds for j in 1:N, i in 1:N
Y[i, j] = b * Y[i, j]
end
@inbounds for k in 1:Nd
nk = A.offsets[k]
Dk = @view A.data[k][:]
i_range = max(1, 1 - nk):min(N, N - nk)
for j in 1:N, i in i_range
Y[i, j] += a * Dk[i+nk] * X[i+nk, j]
end
end
return Y
end

"""Compute Y = aXA + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, X::Matrix{T}, A::SparseMatrixDIA{T,Nd}, a::T, b::T) where {T,Nd}
@inbounds for j in 1:N, i in 1:N
Y[i, j] = b * Y[i, j]
end
@inbounds for k in 1:Nd
nk = A.offsets[k]
Dk = @view A.data[k][:]
j_range = max(1, 1 + nk):min(N, N + nk)
for j in j_range, i in 1:N
Y[i, j] += a * Dk[j-nk] * X[i, j-nk]
end
end
return Y
end
``````

However, when benchmarking this performs much worse than the regular `SparseMatrixCSC` format from `SparseArrays.jl`. Here is the benchmark:

``````using LinearAlgebra
using SparseArrays
using BenchmarkTools

# Define matrices
const N = 200
offsets = [-7, -5, 1, 3, 13]
diags_DIA = [rand(N) for k in eachindex(offsets)]
diags_CSC = [diags_DIA[k][max(1, 1 + offsets[k]):min(N, N + offsets[k])] for k in eachindex(offsets)]

A_DIA = SparseMatrixDIA(tuple(offsets...), tuple(diags_DIA...))
A_CSC = spdiagm([offsets[k] => diags_CSC[k] for k in eachindex(offsets)]...)
X = rand(N, N)
Y = zeros(N, N)
a, b = 1.0, 0.0

# Check mul! works
mul!(Y, A_CSC, X, a, b) == mul!(Y, A_DIA, X, a, b)
mul!(Y, X, A_CSC, a, b) == mul!(Y, X, A_DIA, a, b)

# Benchmark
println("Timing SparseMatrixCSC mul!.")
println("(left-side mul)")
@btime mul!(Y, A_CSC, X, a, b)
println("(right-side mul)")
@btime mul!(Y, X, A_CSC, a, b)
println("\nTiming SparseMatrixDIA mul!.")
println("(left-side mul)")
@btime mul!(Y, A_DIA, X, a, b)
println("(right-side mul)")
@btime mul!(Y, X, A_DIA, a, b)
``````

which returns

``````Timing SparseMatrixCSC mul!.
(left-side mul)
368.931 μs (0 allocations: 0 bytes)
(right-side mul)
33.897 μs (0 allocations: 0 bytes)

Timing SparseMatrixDIA mul!.
(left-side mul)
66.708 μs (0 allocations: 0 bytes)
(right-side mul)
62.886 μs (0 allocations: 0 bytes)
``````

Why is my code so much slower than `SparseMatrixCSC` (in the right-side multiplication at least) ? I would expect to be faster since my functions take advantage of the specific matrix structure. And how can I improve this?

2 Likes

Have you tried with BandedMatrices.jl and its blocked cousin?

Thanks for your answer ! Unfortunately, using BandedMatrices.jl seems to yield slower multiplications. With the same benchmark code as before, and

``````A_BAN = BandedMatrix([offsets[k] => diags_CSC[k] for k in eachindex(offsets)]...)
``````

I get the following timings

``````Timing SparseMatrixCSC mul!.
(left-side mul)
368.255 μs (0 allocations: 0 bytes)
(right-side mul)
33.027 μs (0 allocations: 0 bytes)

Timing SparseMatrixDIA mul!.
(left-side mul)
69.428 μs (0 allocations: 0 bytes)
(right-side mul)
62.909 μs (0 allocations: 0 bytes)

Timing BandedMatrices mul!.
(left-side mul)
1.631 ms (0 allocations: 0 bytes)
(right-side mul)
1.003 ms (0 allocations: 0 bytes)
``````

So about a factor x15 slower than my `SparseMatrixDIA` implementation, and x30 compared to the right-side multiplication of `SparseMatrixCSC`.

In the right-side mul it isn’t so much slower, just 2X. In the right-side case, the CSC format turns out to be sort of optimal because it pre-calculates all the indices/conditions your code does when it constructs the sparse CSC. Specifically, the CSC format has the column indices in the `rowval` field. The `colptr` field has the information regarding how many columns are needed to combine for each output column (so it doesn’t need to find where any diagonal ends)… It is simply ideal for this operation.

So, combine the precalculation with optimized column operations, and you get a 2X improvement. While playing with some code managed to reduce runtime to similar to CSC time, but it essentially replicates the creation of the CSC data.

Thanks for your answer ! So, I cannot expect so much speed up on the right-side multiplication, but do you think I could have a custom type that also features the same kind of performance as `SparseMatrixCSC` on both left and right-side multiplication?

Your function is already faster on left-side than CSC (and quite close on right-side - with more effort can be made equal on right-side).
There is always an inherent asymmetry because matrices choose either row-major or column-major memory layout.

Having said this, from experience, sometimes optimization can surprise. So if this is a crucial bit of your application, keep hacking at it, and the community might also help (and surprise us with better solutions).

P.S. some of the best wins come from using special knowledge of the application outside of the specific region considered - something in the overall algorithm.

For example, while thinking on this function, I thought perhaps a `@generated` function specializing on `Nd` could benefit - by unrolling the loop. Again, I’m not sure how important a 10% speedup is.

1 Like

Another thing, the new `mul!` functions defined in the original post use global `N` const (instead of locally calculating `N`).
And perhaps this could be generalized to non-square matmul (like the sparse version allows).

1 Like

Thanks again for your detailed answer. So, I’m reading up on `@generated` functions. One approach I tried was to give the struct two additional fields that are precomputed, and that only depend on the `offsets` field (which again is fixed throughout computation).

``````struct SparseMatrixDIA{T,Nd}     # Nd is the number of non-zero diagonals
offsets::NTuple{Nd,Int}      # Offset of each diagonal (0 is the main diagonal). offsets is sorted.
data::NTuple{Nd,Vector{T}}   # Values in each diagonal
i_ranges::Vector{UnitRange{Int}} # Ranges of indices with the same offset diagonals involved
idx_pos::Int                 # Index of the first positive offset (assuming offsets is sorted).
end
``````

For my benchmark example, with `offsets = [-7, -5, 1, 3, 13]`, these two fields would read

``````i_ranges = [1:5, 6:7, 8:N-13, N-12:N-3, N-2:N-1, N:N]
idx_pos = 
``````

Such that in the range `1:5` of indices, only the diagonals with offsets `[1, 3, 13]` are involved in the computation, in the range `6:7`, only the diagonals with offsets `[-5, 1, 3, 13]`, etc. Using these two new fields, I can write a `mul!` function as

``````"""Compute Y = aAX + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, A::SparseMatrixDIA{T,Nd}, X::Matrix{T}, a::T, b::T) where {T,Nd}
@inbounds for j in 1:N
for l in eachindex(A.i_ranges)
for i in A.i_ranges[l]
Y[i, j] = b * Y[i, j]
for k in max(1 + A.idx_pos - l, 1):min(Nd + A.idx_pos - l, Nd)
nk = A.offsets[k]
Y[i, j] += a * A.data[k][i+nk] * X[i+nk, j]
end
end
end
end
end
``````

which now has the great benefit of running over each pair of indices `(i,j)` only once. However, when benchmarking this is highly inefficient because (I’m guessing), the loop over `for i in A.i_ranges[l]` is not fixed at compile time, but depends on the input matrix. Could I solve this with the `@generated` macro? Does this mean I would need to pass the `i_ranges` field as a type parameter? Something like `struct SparseMatrixDIA{T,Nd,i_ranges}`?

I’m not super familiar with generated functions so any help on how to proceed is appreciated. Thanks !

A nice trick I tried is to make a constructor which takes the simple diagonal initialization like in the earlier version, and constructs the pre-calculated schedule for `mul!`.

Let’s see if I can find what I had:

``````struct OffdiagMatrix6{T,Nd}
offsets::NTuple{Nd,Int}     # Offset of each diagonal (0 is the main diagonal)
data::NTuple{Nd,Vector{T}}  # Values in each diagonal
sched::Vector{Tuple{Int, Int, T}} # Prepared column
OffdiagMatrix6(offsets::NTuple{Nd,Int}, diags::NTuple{Nd,Vector{T}}) where {T, Nd} = begin
N = abs(offsets)+length(diags)
sched = sort([ (j, offsets[i]+j, diags[i][j+min(0,offsets[i])]) for i in 1:Nd for j in 1:N if 1 <= offsets[i]+j <= N ])
new{T, Nd}(offsets, diags, sched)
end
end

A_DIA6 = OffdiagMatrix6(tuple(offsets...), tuple(diags_DIA...))

function mul2!(Y::Matrix{T}, X::Matrix{T}, A::OffdiagMatrix6{T,Nd}, a::T, b::T) where {T,Nd}
N = size(X,1)
@inbounds for j in 1:N
for i in 1:N
Y[i, j] *= b
end
end
@inbounds for (k,c,f) in A.sched
for i in 1:N
Y[i, k] += X[i,c]*f*a
end
end
end
``````

This code achieved the same speed on right-side as CSC for some combination of `@inbounds` `@simd` `@turbo` on the `for` loops.

As for `@generated` functions, I used google, and got some good references on top slots.

1 Like