Efficient sparse-dense matrix multiplication for diagonal-like sparse matrices

I want to implement an efficient and in-place sparse-dense matrix multiplication for sparse matrices with only a few non-zero offset diagonals (known as DIA sparse matrices). Here however, the number of non-zero diagonals and their offsets are known in advance, and do not change throughout the computation. An example of such a matrix is

A = [[0, 0, 3, 0],
     [1, 0, 0, 2],
     [0, 2, 0, 0],
     [0, 0, 1, 0]]
# A is stored as
offsets = [-1, 2] 
data = [[1,2,1,0],[0,0,3,2]]

I have written a structure to store such matrices, which looks like

struct SparseMatrixDIA{T,Nd}    # Nd is the number of non-zero diagonals
    offsets::NTuple{Nd,Int}     # Offset of each diagonal (0 is the main diagonal)
    data::NTuple{Nd,Vector{T}}  # Values in each diagonal
end

I have also written left-side and right-side sparse-dense matrix multiplication functions, following the structure of LinearAlgebra.mul!:

"""Compute Y = aAX + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, A::SparseMatrixDIA{T,Nd}, X::Matrix{T}, a::T, b::T) where {T,Nd}
    @inbounds for j in 1:N, i in 1:N
        Y[i, j] = b * Y[i, j]
    end
    @inbounds for k in 1:Nd
        nk = A.offsets[k]
        Dk = @view A.data[k][:]
        i_range = max(1, 1 - nk):min(N, N - nk)
        for j in 1:N, i in i_range
            Y[i, j] += a * Dk[i+nk] * X[i+nk, j]
        end
    end
    return Y
end

"""Compute Y = aXA + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, X::Matrix{T}, A::SparseMatrixDIA{T,Nd}, a::T, b::T) where {T,Nd}
    @inbounds for j in 1:N, i in 1:N
        Y[i, j] = b * Y[i, j]
    end
    @inbounds for k in 1:Nd
        nk = A.offsets[k]
        Dk = @view A.data[k][:]
        j_range = max(1, 1 + nk):min(N, N + nk)
        for j in j_range, i in 1:N
            Y[i, j] += a * Dk[j-nk] * X[i, j-nk]
        end
    end
    return Y
end

However, when benchmarking this performs much worse than the regular SparseMatrixCSC format from SparseArrays.jl. Here is the benchmark:

using LinearAlgebra
using SparseArrays
using BenchmarkTools

# Define matrices
const N = 200
offsets = [-7, -5, 1, 3, 13]
diags_DIA = [rand(N) for k in eachindex(offsets)]
diags_CSC = [diags_DIA[k][max(1, 1 + offsets[k]):min(N, N + offsets[k])] for k in eachindex(offsets)]

A_DIA = SparseMatrixDIA(tuple(offsets...), tuple(diags_DIA...))
A_CSC = spdiagm([offsets[k] => diags_CSC[k] for k in eachindex(offsets)]...)
X = rand(N, N)
Y = zeros(N, N)
a, b = 1.0, 0.0

# Check mul! works
mul!(Y, A_CSC, X, a, b) == mul!(Y, A_DIA, X, a, b)
mul!(Y, X, A_CSC, a, b) == mul!(Y, X, A_DIA, a, b)

# Benchmark
println("Timing SparseMatrixCSC mul!.")
println("(left-side mul)")
@btime mul!(Y, A_CSC, X, a, b)
println("(right-side mul)")
@btime mul!(Y, X, A_CSC, a, b)
println("\nTiming SparseMatrixDIA mul!.")
println("(left-side mul)")
@btime mul!(Y, A_DIA, X, a, b)
println("(right-side mul)")
@btime mul!(Y, X, A_DIA, a, b)

which returns

Timing SparseMatrixCSC mul!.
(left-side mul)
  368.931 μs (0 allocations: 0 bytes)
(right-side mul)
  33.897 μs (0 allocations: 0 bytes)

Timing SparseMatrixDIA mul!.
(left-side mul)
  66.708 μs (0 allocations: 0 bytes)
(right-side mul)
  62.886 μs (0 allocations: 0 bytes)

Why is my code so much slower than SparseMatrixCSC (in the right-side multiplication at least) ? I would expect to be faster since my functions take advantage of the specific matrix structure. And how can I improve this?

2 Likes

Have you tried with BandedMatrices.jl and its blocked cousin?

Thanks for your answer ! Unfortunately, using BandedMatrices.jl seems to yield slower multiplications. With the same benchmark code as before, and

A_BAN = BandedMatrix([offsets[k] => diags_CSC[k] for k in eachindex(offsets)]...)

I get the following timings

Timing SparseMatrixCSC mul!.
(left-side mul)
  368.255 μs (0 allocations: 0 bytes)
(right-side mul)
  33.027 μs (0 allocations: 0 bytes)

Timing SparseMatrixDIA mul!.
(left-side mul)
  69.428 μs (0 allocations: 0 bytes)
(right-side mul)
  62.909 μs (0 allocations: 0 bytes)

Timing BandedMatrices mul!.
(left-side mul)
  1.631 ms (0 allocations: 0 bytes)
(right-side mul)
  1.003 ms (0 allocations: 0 bytes)

So about a factor x15 slower than my SparseMatrixDIA implementation, and x30 compared to the right-side multiplication of SparseMatrixCSC.

In the right-side mul it isn’t so much slower, just 2X. In the right-side case, the CSC format turns out to be sort of optimal because it pre-calculates all the indices/conditions your code does when it constructs the sparse CSC. Specifically, the CSC format has the column indices in the rowval field. The colptr field has the information regarding how many columns are needed to combine for each output column (so it doesn’t need to find where any diagonal ends)… It is simply ideal for this operation.

So, combine the precalculation with optimized column operations, and you get a 2X improvement. While playing with some code managed to reduce runtime to similar to CSC time, but it essentially replicates the creation of the CSC data.

Thanks for your answer ! So, I cannot expect so much speed up on the right-side multiplication, but do you think I could have a custom type that also features the same kind of performance as SparseMatrixCSC on both left and right-side multiplication?

Your function is already faster on left-side than CSC (and quite close on right-side - with more effort can be made equal on right-side).
There is always an inherent asymmetry because matrices choose either row-major or column-major memory layout.

Having said this, from experience, sometimes optimization can surprise. So if this is a crucial bit of your application, keep hacking at it, and the community might also help (and surprise us with better solutions).

P.S. some of the best wins come from using special knowledge of the application outside of the specific region considered - something in the overall algorithm.

For example, while thinking on this function, I thought perhaps a @generated function specializing on Nd could benefit - by unrolling the loop. Again, I’m not sure how important a 10% speedup is.

1 Like

Another thing, the new mul! functions defined in the original post use global N const (instead of locally calculating N).
And perhaps this could be generalized to non-square matmul (like the sparse version allows).

1 Like

Thanks again for your detailed answer. So, I’m reading up on @generated functions. One approach I tried was to give the struct two additional fields that are precomputed, and that only depend on the offsets field (which again is fixed throughout computation).

struct SparseMatrixDIA{T,Nd}     # Nd is the number of non-zero diagonals
    offsets::NTuple{Nd,Int}      # Offset of each diagonal (0 is the main diagonal). offsets is sorted.
    data::NTuple{Nd,Vector{T}}   # Values in each diagonal
    i_ranges::Vector{UnitRange{Int}} # Ranges of indices with the same offset diagonals involved
    idx_pos::Int                 # Index of the first positive offset (assuming offsets is sorted).
end

For my benchmark example, with offsets = [-7, -5, 1, 3, 13], these two fields would read

i_ranges = [1:5, 6:7, 8:N-13, N-12:N-3, N-2:N-1, N:N]
idx_pos = [3]

Such that in the range 1:5 of indices, only the diagonals with offsets [1, 3, 13] are involved in the computation, in the range 6:7, only the diagonals with offsets [-5, 1, 3, 13], etc. Using these two new fields, I can write a mul! function as

"""Compute Y = aAX + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, A::SparseMatrixDIA{T,Nd}, X::Matrix{T}, a::T, b::T) where {T,Nd}
    @inbounds for j in 1:N
        for l in eachindex(A.i_ranges)
            for i in A.i_ranges[l]
                Y[i, j] = b * Y[i, j]
                for k in max(1 + A.idx_pos - l, 1):min(Nd + A.idx_pos - l, Nd)
                    nk = A.offsets[k]
                    Y[i, j] += a * A.data[k][i+nk] * X[i+nk, j]
                end
            end
        end
    end
end

which now has the great benefit of running over each pair of indices (i,j) only once. However, when benchmarking this is highly inefficient because (I’m guessing), the loop over for i in A.i_ranges[l] is not fixed at compile time, but depends on the input matrix. Could I solve this with the @generated macro? Does this mean I would need to pass the i_ranges field as a type parameter? Something like struct SparseMatrixDIA{T,Nd,i_ranges}?

I’m not super familiar with generated functions so any help on how to proceed is appreciated. Thanks !

A nice trick I tried is to make a constructor which takes the simple diagonal initialization like in the earlier version, and constructs the pre-calculated schedule for mul!.

Let’s see if I can find what I had:

struct OffdiagMatrix6{T,Nd}
    offsets::NTuple{Nd,Int}     # Offset of each diagonal (0 is the main diagonal)
    data::NTuple{Nd,Vector{T}}  # Values in each diagonal
    sched::Vector{Tuple{Int, Int, T}} # Prepared column
    OffdiagMatrix6(offsets::NTuple{Nd,Int}, diags::NTuple{Nd,Vector{T}}) where {T, Nd} = begin
        N = abs(offsets[1])+length(diags[1])
        sched = sort([ (j, offsets[i]+j, diags[i][j+min(0,offsets[i])]) for i in 1:Nd for j in 1:N if 1 <= offsets[i]+j <= N ])
        new{T, Nd}(offsets, diags, sched)
    end           
end

A_DIA6 = OffdiagMatrix6(tuple(offsets...), tuple(diags_DIA...))

function mul2!(Y::Matrix{T}, X::Matrix{T}, A::OffdiagMatrix6{T,Nd}, a::T, b::T) where {T,Nd}
    N = size(X,1)
    @inbounds for j in 1:N
        for i in 1:N
            Y[i, j] *= b
        end
    end
    @inbounds for (k,c,f) in A.sched
        for i in 1:N
            Y[i, k] += X[i,c]*f*a
        end
    end
end

This code achieved the same speed on right-side as CSC for some combination of @inbounds @simd @turbo on the for loops.

As for @generated functions, I used google, and got some good references on top slots.

1 Like