Faster matrix product with a specifically structured matrix


I have a sparse matrix P that turns out to have only equal values (or zeros) on each line:

julia> all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])


Where I used <=1 since some lines are completely zeros.

With this matrix, I’m repetitively doing matrix-vector products P*x on different vectors. But since every line has only one value, I guess that the matrix-vector product could be done more efficiently than what SparseArrays does.

Any Idea if there is a ready-to-use matrix class like this one, or if I could use a simple one-hot-encoding structure to solve this multiplication issue more efficiently ?

Did you notice any particular slowdowns? Do you have an example for your matrix-vector product?

(Though I don’t think this is necessarily needed here.)

I do have an example but its a little hidden behind other things.

You could construct easily one of these matrices :

P = (rand(1000,1000).>0.97) .* rand(1000)

Where the 0.97 and size 1000 is what I have empirically on a true example.

That’s a dense matrix though. All zero elements are stored, you can convert that to a true sparse matrix like this:

julia> using SparseArrays                                             
julia> P = sparse((rand(1000,1000).>0.97) .* rand(1000));             
julia> summary(P)                                                     
"1000×1000 SparseMatrixCSC{Float64, Int64} with 29741 stored entries" 

Multiplication of which should be pretty fast.

That construction doesn’t necessarily obey the one-hot criterion by the way. It’s just unlikely to not be one-hot. Do you have other guarantees that make sure this still holds?

Yes this is what i’m doing for the moment. But since each row has only one possible non-zero value, I was hoping for a further improvement.

I’m pretty sure that the way I’m constructing the matrix garanties that my criterion (in my first post) will always hold.

Do you mean in your real example or in the way you’ve shown here? What you’ve done here is basically just sprand, except for a dense matrix (and allocating a whole bunch of intermediaries):

julia> P = sprand(1000,1000,0.03)                                  
1000×1000 SparseMatrixCSC{Float64, Int64} with 29667 stored entries

and for that, the property absolutely does not hold: scratch that, wrong comparison

1 Like

Well, In my real examples it holds, but also :

julia> P = sparse((rand(1000,1000).>0.97) .* rand(1000));

julia> all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])


However, as you noted,

julia> P = sprand(1000,1000,0.03);

julia> all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])


This is not the same thing.

Furthermore, the rate of non-zero values is also around 3% on each line.

Which means that SparseArrays is doing, for every line i of the matrix-vector product,

\sum_{j \in \text{Non-zero}} P[i,j]*x[j]

while we could do :

P[i,j]*\sum_{j \in \text{Non-zero}} x[j]

Which reduces the number of multiplications from ~ (3% of the number of columns) to 1 per line, inducing a 30x theoretical speedup for 1000*1000 matrices (neglecting the addition cost). I was asking about a neat way of implementing this.

You’re right - my mistake. It should have been:

julia> P = sprand(Bool, 1000, 1000, 0.03) .* rand(1000)

This keeps sparsity and your property that each column only has one unique value though it may occur more than once:

julia> findnz(P[1,:])[2] |> length
julia> findnz(P[1,:])[2][1]       
julia> findnz(P[1,:])[2][1:4]     
4-element Vector{Float64}:        

How would you move the iterating index outside of the loop…?

I think this is a red herring. You still have to store the indices of each column where there is a nonzero, even if you only store “one” nonzero per column. A general matrix product still has to look at all of those.

As you noted every values of P[i,j] is the same for a fixed line.

Here is a dummy implementation:

n,m = 1000,1000
x = rand(m)
P = sparse((rand(n,m).>0.97) .* rand(m));

# check predicate:
all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])

# extract the b vector: 
b = [sum(unique(P[i,:])) for i in 1:size(P,1)]

# compute the matrix-vector product: 
rez = similar(b)
for i in 1:n
    rez[i] = b[i]*sum(x[P[i,:].nzind])

rez ≈ P*x

It’s not though:

julia> findnz(P[1,:])[1] == findnz(P[2,:])[1] # column 1 has different nonzeros to column 2
julia> findnz(P[:,1])[1] == findnz(P[:,2])[1] # row 1 has different nonzeros to row 2

I think you’re confusing columns and rows. Julia is column major, i.e. the first index iterates over columns.

In any case, the sparsity per row and per column is different. There’s no shortcut here (well, or I’m missing something major here, but I don’t think so).

Hum…; This is not concordant with what my machine says:

julia> P = [1 2
       3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> for i in 1:2
2-element Vector{Int64}:
2-element Vector{Int64}:

Right, mixup on my side - my apologies! :sweat_smile:

Still, the sparsity pattern per row is not the always the same. You have to check each row seperately. Same goes for columns.

Yes, I have to check the sparsity pattern of each row separately, which is what i’ve done in my dummy implmementation. But still, the number of multiplications is way reduced, and hence I can expect better performance. For the moment, my implementation is allocating like crazy so I do not have a better perf than SparseArrays, but this should be possible :slight_smile:

You’re allocating like crazy because x[P[i,:].nzind] creates a copy (as do a bunch of other operations you’re doing, like indexing with a bitmatrix…).

This code seems very confusing, am I reading this right that you’re using the sparsity pattern of one matrix to index into another…? It reads very Matlab-y, where forcing everything into matrix/vector operations is the way to go, but this pattern will not lead to good performance in julia, I’m afraid.

May I ask, what problem/code are you trying to accelerate exactly? I don’t mean the matmul, I mean what you’re doing with the matmul. Loops in julia are fast, no need to shy away from them.

I’m sorry I do not know about Matlab.

With the matmul i’m doing a Dykstra projection in an optimization routine. This matmul is evaluated billions of time, and represent 50% of my runtime (hence my interest for it ^^ ).

Yes, for the moment it is allocating, but dont you think a non-allocative version could be produced, and will provide a better perf than SparseArrays ? This is why I was asking about hot-encoding at first (using a sparsity pattern of a matrix is like using a bitmatrix).

Ho maybe the bitmatrix is what I should use ?

I don’t think so. This should have the same number of multiplications as a regular sparse matrix multiplication, since you’re taking the indices per row from the sparse matrix anyway.

No, I don’t think so, which is why I’m confused about you trying to up the ante on known industry practices.

Are you certain that that runtime is due to slow matmul of real sparse vectors and not due to e.g. a lot of allocations from indexing with vectors etc? Can you share the code that’s running in a hot loop that’s too slow?

I do not think you understood the predicate I use and the structure of the matrix: For each line of the matrix, every non-zero-value is the same value. Hence the matrix-vector product fatorizes. The number of multiplications is reduced, but, you are right, the indexing is still the same.

Here is a (quickly-extracted so not that clean) MWE:

using SparseArrays
struct SparseLinearConstraint{T}
    P::SparseMatrixCSC{T, Int64}
    q::SparseVector{T, Int64}
    rez::SparseVector{T, Int64}
function SparseLinearConstraint(A::Matrix{T},b::Vector{T}) where T
    AiAA = A'inv(A*A')
    P = sparse(-AiAA*A)
    q = sparse(+AiAA*b)
    return SparseLinearConstraint(A,b,P,q,spzeros(eltype(q),length(q)))
@inline function proj!(x,X::SparseLinearConstraint)
    X.rez .= x + X.P*x + X.q  ##### This lines takes 70% of runtime, due to the sparse matrix-vector product.
struct SparsePositivityConstraint{T} 
    rez::SparseVector{T, Int64}
function SparsePositivityConstraint(x)
    return SparsePositivityConstraint(spzeros(eltype(x),length(x)),sqrt(eps(eltype(x))))
@inline function proj!(x,X::SparsePositivityConstraint)
    X.rez .= x
    SparseArrays.fkeep!(X.rez, (i,x) -> x>X.tol)

function Dykstra!(x,ctrs)
    old_x = zero(x)
    err = sqrt(eps(eltype(x)))
    sn = [zero(x) for _ in ctrs]
    while !all(abs.(old_x .-x) .< err)
        old_x .= x
        for (i,ctr) in enumerate(ctrs)
            sn[i] += ctr.rez - x
            x .= ctr.rez
    return nothing

# A simple test case:
using Random
function linearize(a::Vector{T},b::Vector{T}) where T
    n1 = length(a)
    n2 = length(b)
    p = (n1+1)*(n2+1)
    eq_A = zeros(T,(n1+n2,p-1))
    eq_b = [a...,b...]
    for i in 0:n2
        for j in 1:n1
            eq_A[j,j+i*(1+n1)] = 1
    for i in 1:n2
        eq_A[n1+i,(1:(n1+1)).+n1.+(i-1)*(n1+1)] .= 1
    return eq_A,eq_b
Solve_Type = Float64
n,m = 30,30;
a1 = rand(n);
a2 = rand(m);
x = Solve_Type.(randn((n+1)*(m+1)-1));
spx = deepcopy(x);
spx = sparse(spx);
A,b= linearize(a1,a2);

constraints = [SparseLinearConstraint(A[1:n,:],a1),

# Some tests: 
spx = deepcopy(x);
@time Dykstra!(spx,constraints)
spx = deepcopy(x);
@time Dykstra!(spx,constraints)
spx = deepcopy(x);
spx = sparse(spx)
@profview Dykstra!(spx,constraints)

The main algorithm is in the Dykstra! function, with a majority of the runtime in the proj!(x,X::SparseLinearConstraint) function.
In this function, profview reports SparseArrays.mul! to be the most time-consuming function.

EDIT: The upper code has been edited once or twice. I have now a working version of what I want, which writes as :

using SparseArrays
n,m = 1000,1000
x = rand(m)
P = sparse((rand(n,m).>0.97) .* rand(m));

# check predicate:
all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])

# extract the b vector and idx vector:
b = [sum(unique(P[i,:])) for i in 1:size(P,1)]
M = sparse(BitMatrix(P .!= 0))
idx = [P[i,:].nzind for i in 1:n]
rez = similar(b)
@inline my_prod!(b,M,x) = b.* (M*x)

rez = my_prod!(b,M,x)
rez ≈ P*x

# test : 
using BenchmarkTools
@btime P*$(x);
@btime my_prod!(b,M,$x);

Which now has the same runtime as the full sparse product, but still allocates a little too much. I’m sure we can do better.