I want to implement an efficient and in-place sparse-dense matrix multiplication for sparse matrices with only a few non-zero offset diagonals (known as DIA sparse matrices). Here however, **the number of non-zero diagonals and their offsets are known in advance**, and do not change throughout the computation. An example of such a matrix is

```
A = [[0, 0, 3, 0],
[1, 0, 0, 2],
[0, 2, 0, 0],
[0, 0, 1, 0]]
# A is stored as
offsets = [-1, 2]
data = [[1,2,1,0],[0,0,3,2]]
```

I have written a `structure`

to store such matrices, which looks like

```
struct SparseMatrixDIA{T,Nd} # Nd is the number of non-zero diagonals
offsets::NTuple{Nd,Int} # Offset of each diagonal (0 is the main diagonal)
data::NTuple{Nd,Vector{T}} # Values in each diagonal
end
```

I have also written left-side and right-side sparse-dense matrix multiplication functions, following the structure of `LinearAlgebra.mul!`

:

```
"""Compute Y = aAX + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, A::SparseMatrixDIA{T,Nd}, X::Matrix{T}, a::T, b::T) where {T,Nd}
@inbounds for j in 1:N, i in 1:N
Y[i, j] = b * Y[i, j]
end
@inbounds for k in 1:Nd
nk = A.offsets[k]
Dk = @view A.data[k][:]
i_range = max(1, 1 - nk):min(N, N - nk)
for j in 1:N, i in i_range
Y[i, j] += a * Dk[i+nk] * X[i+nk, j]
end
end
return Y
end
"""Compute Y = aXA + bY"""
function LinearAlgebra.mul!(Y::Matrix{T}, X::Matrix{T}, A::SparseMatrixDIA{T,Nd}, a::T, b::T) where {T,Nd}
@inbounds for j in 1:N, i in 1:N
Y[i, j] = b * Y[i, j]
end
@inbounds for k in 1:Nd
nk = A.offsets[k]
Dk = @view A.data[k][:]
j_range = max(1, 1 + nk):min(N, N + nk)
for j in j_range, i in 1:N
Y[i, j] += a * Dk[j-nk] * X[i, j-nk]
end
end
return Y
end
```

However, when benchmarking this performs much worse than the regular `SparseMatrixCSC`

format from `SparseArrays.jl`

. Here is the benchmark:

```
using LinearAlgebra
using SparseArrays
using BenchmarkTools
# Define matrices
const N = 200
offsets = [-7, -5, 1, 3, 13]
diags_DIA = [rand(N) for k in eachindex(offsets)]
diags_CSC = [diags_DIA[k][max(1, 1 + offsets[k]):min(N, N + offsets[k])] for k in eachindex(offsets)]
A_DIA = SparseMatrixDIA(tuple(offsets...), tuple(diags_DIA...))
A_CSC = spdiagm([offsets[k] => diags_CSC[k] for k in eachindex(offsets)]...)
X = rand(N, N)
Y = zeros(N, N)
a, b = 1.0, 0.0
# Check mul! works
mul!(Y, A_CSC, X, a, b) == mul!(Y, A_DIA, X, a, b)
mul!(Y, X, A_CSC, a, b) == mul!(Y, X, A_DIA, a, b)
# Benchmark
println("Timing SparseMatrixCSC mul!.")
println("(left-side mul)")
@btime mul!(Y, A_CSC, X, a, b)
println("(right-side mul)")
@btime mul!(Y, X, A_CSC, a, b)
println("\nTiming SparseMatrixDIA mul!.")
println("(left-side mul)")
@btime mul!(Y, A_DIA, X, a, b)
println("(right-side mul)")
@btime mul!(Y, X, A_DIA, a, b)
```

which returns

```
Timing SparseMatrixCSC mul!.
(left-side mul)
368.931 μs (0 allocations: 0 bytes)
(right-side mul)
33.897 μs (0 allocations: 0 bytes)
Timing SparseMatrixDIA mul!.
(left-side mul)
66.708 μs (0 allocations: 0 bytes)
(right-side mul)
62.886 μs (0 allocations: 0 bytes)
```

Why is my code so much slower than `SparseMatrixCSC`

(in the right-side multiplication at least) ? I would expect to be faster since my functions take advantage of the specific matrix structure. And how can I improve this?