Performance of sparse matrix multiplied with a reshaped view

Hello everybody,

this question is very similar to other performance questions with sparse matrix multiplications where performance hits boil down to julia falling back to unoptimized methods. Still, I would like to ask for confirmation and eventually some hint how to regain performance.

First I would like to establish a baseline timing for the MWE:

julia> using LinearAlgebra, SparseArrays, BenchmarkTools
julia> function mymul(A,x)
         return z
mymul (generic function with 1 method)

julia> A=rand(100,100);
julia> B=sparse(rand(100,100));

julia> vec=rand(100);

julia> @btime Z=mymul($A,$vec);
  1.059 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vec);
  6.660 μs (1 allocation: 896 bytes)

Multiplying a sparse matrix with a dense vector is about six to seven times slower than multiplying a dense matrix with the same vector. That appears to be reasonable in this example.

Changing the vector to a view of a vector does not change the timings

julia> vec2=rand(200);
julia> vecview=@view vec2[1:100];

julia> @btime Z=mymul($A,$vecview);
  1.058 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vecview);
  6.882 μs (1 allocation: 896 bytes)

which is very nice. I was surprised how well the view works for either case, the overhead of a view is gone in the current version.

However, when using reshape() to make a vector form a view inside an array

julia> ar=rand(102,102);
julia> arview=@view ar[5:14,5:14];
julia> vec3=reshape(arview,:);

julia> @btime Z=mymul($A,$vec3);
  1.894 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vec3);
  141.315 μs (1 allocation: 896 bytes)

julia> @which A*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49
julia> @which B*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49

the performance hit in the dense matrix case is moderate while in the sparse case it is quite big.
I would like to ask some questions about this.

Is the loss of performance in the dense case from the elements in vec3 being no longer adjacent in memory ? Is this a case where for the given types the fallback method for the sparse matrix case is just slow ? This is not obvious to me from the output of @which.

Finally, is there a better way to construct the vector vec3 ? Of course a collect() will remove the performance penalty but that creates a copy which I would like to avoid.

Best Regards

Yes, I think so:

julia> arview isa StridedArray

julia> vec3 isa StridedArray

The dispatch to generic_matvecmul! happens a few steps later, I think here: @which mul!(similar(vec3), A, vec3, true, false) which you can get to with @less a lot, or Dubugger.@enter.

Instead of collect(vec3), perhaps you can re-use some vectors (both for this and for the output) allocated outside the fast bit?

Why B*vec3 is slower I’m not sure.

Thank you for the confirmation. I will have to remember that it is possible to check if something is of a given type (if you now the type). I might try to drill deeper to understand what happens in the sparse case.

Yes, passing a “scratch vector” around would be a pragmatic solution.