Faster matrix product with a specifically structured matrix

I thonk there is a recent package for threaded multiplication of sparse matrices

by isolating the part of the code that should exploit the particular structure of the matrix, we see that the improvement that can be obtained is not a lot

using BenchmarkTools
  v=collect(1.:10^6)
  x=similar(v)
  fill!(x,1.999)

julia> @btime begin
         c=0.
         for j in 1:10^6
             c += $v[j]*$x[j]
         end
       c
       end
  942.800 μs (0 allocations: 0 bytes)
9.995009995e11

julia> @btime begin
         c=0.
         for j in 1:10^6
           c += $v[j]
         end
         c *= $x[1]
       end
  866.600 μs (0 allocations: 0 bytes)
9.995009995e11

Not necessarily:

julia> @btime let c=0.
                @simd for j in 1:10^6
                    @inbounds c += $v[j]*$x[j]
                end
              c
              end
  639.450 μs (0 allocations: 0 bytes)
9.995009995e11

julia> @btime let c=0.
                @simd for j in 1:10^6
                    @inbounds c += $x[j]
                end
              c *= $x[1]
              end
  211.213 μs (0 allocations: 0 bytes)
3.9960009999980503e6

below are the measures for library multiplication, the qmul (P, x) function (which is a copy of the library algorithm with @simd added even if it doesn’t seem to have any effect in this case) and the my_mul (Pt, x) that works on the matrix transports in order to use the rowvals and nzaranges functions by exploiting the structure of P (that is to have the rows with equal values).

PS
I don’t know how the @simd macro works so I don’t know why in the case of the qmul () function it doesn’t have the same effect as it does with my_mul ().
Maybe I didn’t use it properly.

julia> using SparseArrays, BenchmarkTools
       
       n,m = 1000,1000
       x = rand(m);
       P = sparse((rand(n,m).>0.97) .* rand(m));
       Pt=sparse(P');

julia> function qmul(Q,y)
         # rv=Q.rowval
         # nzv= Q.nzval
         rv=rowvals(Q)
         nzv= nonzeros(Q)
         c=Vector{Float64}(undef, size(Q,1))
         fill!(c,0.)
         @simd for col in 1:size(Q, 2)
           for j in nzrange(Q, col)
             @inbounds     c[rv[j]] += nzv[j]*y[col]
           end
         end
         c
       end
qmul (generic function with 1 method)

julia> function my_mul(Qt,y)
         rv=rowvals(Qt)
         nzv= nonzeros(Qt)
         c=Vector{Float64}(undef, size(Qt,2))
         fill!(c,0.)
         @simd for col in 1:size(Qt, 2)
             rj=nzrange(Qt,col)
             for j in rj
           @inbounds  c[col] += y[rv[j]]
             end
             c[col] *= nzv[rj.stop]
         end
         c
       end
my_mul (generic function with 1 method)

julia> @btime P*x;
  18.800 μs (1 allocation: 7.94 KiB)

julia> @btime my_mul(Pt,x);
  15.800 μs (1 allocation: 7.94 KiB)

julia> @btime qmul(P,x);
  18.900 μs (1 allocation: 7.94 KiB)

julia> P*x ≈ my_mul(Pt,x)
true

julia> P*x ≈ qmul(P,x)
true

if i define x locally … there is always about 400us of difference

julia> @btime let
         c=0.
         x=similar($v);
         fill!(x,1.999);
        @simd for j in 1:10^6
          @inbounds   c += $v[j]*x[j]
         end
       c
       end
  1.949 ms (2 allocations: 7.63 MiB)
9.995009995e11

julia> @btime begin
         c=0.
         x=similar($v);
         fill!(x,1.999);
        @simd for j in 1:10^6
         @inbounds   c += $v[j]
         end
         c *= $x[1]
       end
  1.404 ms (2 allocations: 7.63 MiB)
9.995009995e11

Sure, but then you’re timing the allocation.

Instead of using the whole matrix you can just save two vectors, one “A” with the non-zero positions and another “B” with the values.
And then instead of multiplying a matrix by your vector you can multiply elementwise “B” by the elements of your “x” given by the positions of “A”.