Taking advantage of rowwise-constant sparse matrix

Hi,

This is a follow-up of this previous thread where we discussed with @Sukera the possibility to improve matrix-vector product in the case where the matrix is sparse, but also has only one possible non-zero value per line.

Mathematically, this predicate on the matrix structure should allow a factorization in the multiplication algorithm and hence enhance performance. I managed to code a simple MWE :

using SparseArrays
n,m = 1000,1000
x = rand(m)
P = sparse((rand(n,m).>0.97) .* rand(m));

# check predicate and decompose: 
function decompose(P)
    @assert all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])
    b = [sum(unique(P[i,:])) for i in 1:size(P,1)]
    M = sparse(BitMatrix(P .!= 0))
    return b,M
end

b,M = decompose(P)
@inline my_prod(b,M,x) = b.*(M*x)


function my_prod(r,b,M,x)
    @inbounds for i in size(M,1)
        r[i] = b[i]*sum(x[M[i,:]])
    end
    return r
end
r = P*x
# check : 
my_prod(r,b,M,x) ≈ P*x

# test : 
using BenchmarkTools
@btime P*$(x);
@btime my_prod(r,b,M,$x);

Which gives :

julia> @btime P*$(x);
  59.700 μs (1 allocation: 7.94 KiB)

julia> @btime my_prod(r,b,M,$x);
  21.500 μs (11 allocations: 1.16 KiB)

julia> 

So a 1/2 time. I was hoping for more, but I start wandering if my testing is fair…

Did I make something wrong ? I notice that my code is allocating a lot, could this be fixed ?

In my application, the matrix P is held fixed and the vector x changes from one iteration to the other, hence my concerns about whether this particular structure could be exploited a little more.

Apologies, I was cooking my dinner. That’s why I didn’t answer right away in the other thread.

This function doesn’t do what you think it does.

  • size(M,1) is a single integer - the loop is only running for a single iteration because numbers are iterable in julia and thus the result must be wrong. You probably meant to write axes(M, 1) or 1:size(M,1). The former gives an iterable over the first axis, the latter constructs the same manually. Your check still passes because only the last entry is overwritten and r still holds the correct result from P*x.
julia> r_ = zeros(size(r)); 
                            
julia> my_prod(r_,b,M,x) ≈ r
false                       
                            
julia> findall(!=(0), r_)   
1-element Vector{Int64}:    
 1000                       
  • Since M[i,:] is a vector, x[M[i,:]] will slice (and thus allocate). This doesn’t hurt too bad for one iteration, but it will kill performance for all appropriate iterations:
julia> function my_prod(r,b,M,x)         
           for i in 1:size(M,1) # fixed iterations         
               r[i] = b[i]*sum(x[M[i,:]])
           end                           
           return r                      
       end                               

my_prod (generic function with 1 method) 
julia> @btime P*x;                        
  29.900 μs (1 allocation: 7.94 KiB)      
                                          
julia> @btime my_prod(r,b,M,$x);          
  39.710 ms (7038 allocations: 952.67 KiB)

julia> r_ = zeros(size(r)); 
                            
julia> my_prod(r_,b,M,x) ≈ r
true                        

Indeed, a silly mistake… And hence the result is 1000x slower… Thanks for catching it.

The inline version still has the same performance as the direct one, with 2 allocations instead of one.

let’s forget about the unrolled version, wich I cannot make a lot faster than 500x times slower than the direct approach. What about the inline version :

using SparseArrays
n,m = 1000,1000
x = rand(m)
P = sparse((rand(n,m).>0.97) .* rand(m));

# check predicate and decompose: 
function decompose(P)
    @assert all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])
    b = [sum(unique(P[i,:])) for i in 1:size(P,1)]
    M = sparse(BitMatrix(P .!= 0))
    return b,M
end

b,M = decompose(P)
@inline my_prod(b,M,x) = @inbounds b.*(M*x)
my_prod(b,M,x) ≈ P*x

# test : 
using BenchmarkTools
@btime P*$(x);
@btime my_prod(b,M,$x);

Gives:

julia> @btime P*$(x);
  59.500 μs (1 allocation: 7.94 KiB)

julia> @btime my_prod(b,M,$x);
  60.600 μs (2 allocations: 15.88 KiB)

Maybe the SparseMatrixCSC{Bool, Int64} product already takes adventage of the Bool eltype ?

I think you really need to exploit the sparsity of this M, or of the original P. Here’s a fast version of the idea to slice M and multiply by the dense x, and it’s still many times slower than the original:

julia> function my_prod_dot2(r,b,MT,x)
           @inbounds for i in axes(MT,2)
               r[i] = b[i] * dot(x, @view MT[:,i])
           end
           return r
       end
my_prod_dot2 (generic function with 1 method)

julia> MT = Matrix{Float64}(M');

julia> @btime my_prod_dot2($r,$b,$MT,$x);
  406.125 μs (0 allocations: 0 bytes)

julia> @btime $P*$(x);
  18.792 μs (1 allocation: 7.94 KiB)

julia> function my_prod_dense(r,b,MT,x)
         mul!(r, MT', x)
         r .*= b
         r
       end;

julia> @btime my_prod_dense($r,$b,$MT,$x);
  49.333 μs (0 allocations: 0 bytes)

julia> r ≈ P * x
true
1 Like

That’s because you’re doing a sparse multiplication again:

The sparse matmul M*x allocates its result res, and b .* res allocates its result. You’re not going to beat sparse matrix multiplication by using sparse matrix multiplication.

Indeed :wink:

I guess my next move is to go see how this sparse matrix multiplication is implemented.

No, there is no smart trick to apply here. If you don’t know a lot more about the specific structure in your P, you basically still have random entries in your matrix to access. Whether or not the values in a column are set to the same value is not going to matter all that much compared to checking whether an index should be multiplied at all.

The only way I can think of that may be faster than this is by only saving a single element per column. This may lead to always-hot caches, but I doubt it will be enough to meaningfully impact the runtime because I’m fairly certain that the index lookups dominate here.

Thanks for taking the time anyway. Two things:

1° My rows are constants, not my columns.
2° Indeed, the lookup might be the problem more than the floats multiplications. With more involved multiplications, this pays out:

using SparseArrays
n,m = 1000,1000
ElType = BigFloat
x = ElType.(rand(m))
P = sparse(ElType.((rand(n,m).>0.97) .* rand(m)));

# check predicate and decompose: 
function decompose(P)
    @assert all([sum(unique(P[i,:]).!=0)<=1 for i in 1:size(P,1)])
    b = [sum(unique(P[i,:])) for i in 1:size(P,1)]
    M = sparse(BitMatrix(P .!= 0))
    return b,M
end

b,M = decompose(P)
@inline my_prod(b,M,x) = @inbounds b.*(M*x)
my_prod(b,M,x) ≈ P*x

# test : 
using BenchmarkTools
@btime P*$(x);
@btime my_prod(b,M,$x);

gives:

julia> @btime P*$(x);
  9.727 ms (126113 allocations: 6.67 MiB)

julia> @btime my_prod(b,M,$x);
  6.875 ms (248222 allocations: 11.14 MiB)

and for DoubleFloats.Double64 numbers this gives:

julia> @btime P*$(x);
  276.300 μs (1 allocation: 15.75 KiB)

julia> @btime my_prod(b,M,$x);
  154.400 μs (2 allocations: 31.50 KiB)

For Float64, the indexing overhead is too high and the multiplication cost too low for this approach to be worth it. But still, I’m not just using SparseArrays multiplications to replace SparseArrays multiplications :wink: