Hi,

This is a follow-up of this previous thread where we discussed with @Sukera the possibility to improve matrix-vector product in the case where the matrix is sparse, but also has only one possible non-zero value per line.

Mathematically, this predicate on the matrix structure should allow a factorization in the multiplication algorithm and hence enhance performance. I managed to code a simple MWE :

```
using SparseArrays
n,m = 1000,1000
x = rand(m)
P = sparse((rand(n,m).>0.97) .* rand(m));
# check predicate and decompose:
function decompose(P)
@assert all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])
b = [sum(unique(P[i,:])) for i in 1:size(P,1)]
M = sparse(BitMatrix(P .!= 0))
return b,M
end
b,M = decompose(P)
@inline my_prod(b,M,x) = b.*(M*x)
function my_prod(r,b,M,x)
@inbounds for i in size(M,1)
r[i] = b[i]*sum(x[M[i,:]])
end
return r
end
r = P*x
# check :
my_prod(r,b,M,x) ≈ P*x
# test :
using BenchmarkTools
@btime P*$(x);
@btime my_prod(r,b,M,$x);
```

Which gives :

```
julia> @btime P*$(x);
59.700 μs (1 allocation: 7.94 KiB)
julia> @btime my_prod(r,b,M,$x);
21.500 μs (11 allocations: 1.16 KiB)
julia>
```

So a 1/2 time. I was hoping for more, but I start wandering if my testing is fair…

Did I make something wrong ? I notice that my code is allocating a lot, could this be fixed ?

In my application, the matrix `P`

is held fixed and the vector `x`

changes from one iteration to the other, hence my concerns about whether this particular structure could be exploited a little more.