I thonk there is a recent package for threaded multiplication of sparse matrices
by isolating the part of the code that should exploit the particular structure of the matrix, we see that the improvement that can be obtained is not a lot
using BenchmarkTools
v=collect(1.:10^6)
x=similar(v)
fill!(x,1.999)
julia> @btime begin
c=0.
for j in 1:10^6
c += $v[j]*$x[j]
end
c
end
942.800 μs (0 allocations: 0 bytes)
9.995009995e11
julia> @btime begin
c=0.
for j in 1:10^6
c += $v[j]
end
c *= $x[1]
end
866.600 μs (0 allocations: 0 bytes)
9.995009995e11
Not necessarily:
julia> @btime let c=0.
@simd for j in 1:10^6
@inbounds c += $v[j]*$x[j]
end
c
end
639.450 μs (0 allocations: 0 bytes)
9.995009995e11
julia> @btime let c=0.
@simd for j in 1:10^6
@inbounds c += $x[j]
end
c *= $x[1]
end
211.213 μs (0 allocations: 0 bytes)
3.9960009999980503e6
below are the measures for library multiplication, the qmul (P, x) function (which is a copy of the library algorithm with @simd added even if it doesn’t seem to have any effect in this case) and the my_mul (Pt, x) that works on the matrix transports in order to use the rowvals and nzaranges functions by exploiting the structure of P (that is to have the rows with equal values).
PS
I don’t know how the @simd macro works so I don’t know why in the case of the qmul () function it doesn’t have the same effect as it does with my_mul ().
Maybe I didn’t use it properly.
julia> using SparseArrays, BenchmarkTools
n,m = 1000,1000
x = rand(m);
P = sparse((rand(n,m).>0.97) .* rand(m));
Pt=sparse(P');
julia> function qmul(Q,y)
# rv=Q.rowval
# nzv= Q.nzval
rv=rowvals(Q)
nzv= nonzeros(Q)
c=Vector{Float64}(undef, size(Q,1))
fill!(c,0.)
@simd for col in 1:size(Q, 2)
for j in nzrange(Q, col)
@inbounds c[rv[j]] += nzv[j]*y[col]
end
end
c
end
qmul (generic function with 1 method)
julia> function my_mul(Qt,y)
rv=rowvals(Qt)
nzv= nonzeros(Qt)
c=Vector{Float64}(undef, size(Qt,2))
fill!(c,0.)
@simd for col in 1:size(Qt, 2)
rj=nzrange(Qt,col)
for j in rj
@inbounds c[col] += y[rv[j]]
end
c[col] *= nzv[rj.stop]
end
c
end
my_mul (generic function with 1 method)
julia> @btime P*x;
18.800 μs (1 allocation: 7.94 KiB)
julia> @btime my_mul(Pt,x);
15.800 μs (1 allocation: 7.94 KiB)
julia> @btime qmul(P,x);
18.900 μs (1 allocation: 7.94 KiB)
julia> P*x ≈ my_mul(Pt,x)
true
julia> P*x ≈ qmul(P,x)
true
if i define x locally … there is always about 400us of difference
julia> @btime let
c=0.
x=similar($v);
fill!(x,1.999);
@simd for j in 1:10^6
@inbounds c += $v[j]*x[j]
end
c
end
1.949 ms (2 allocations: 7.63 MiB)
9.995009995e11
julia> @btime begin
c=0.
x=similar($v);
fill!(x,1.999);
@simd for j in 1:10^6
@inbounds c += $v[j]
end
c *= $x[1]
end
1.404 ms (2 allocations: 7.63 MiB)
9.995009995e11
Sure, but then you’re timing the allocation.
Instead of using the whole matrix you can just save two vectors, one “A” with the non-zero positions and another “B” with the values.
And then instead of multiplying a matrix by your vector you can multiply elementwise “B” by the elements of your “x” given by the positions of “A”.