# Performance of sparse matrix multiplied with a reshaped view

Hello everybody,

this question is very similar to other performance questions with sparse matrix multiplications where performance hits boil down to julia falling back to unoptimized methods. Still, I would like to ask for confirmation and eventually some hint how to regain performance.

First I would like to establish a baseline timing for the MWE:

``````julia> using LinearAlgebra, SparseArrays, BenchmarkTools
julia> function mymul(A,x)
z=A*x
return z
end
mymul (generic function with 1 method)

julia> A=rand(100,100);
julia> B=sparse(rand(100,100));

julia> vec=rand(100);

julia> @btime Z=mymul(\$A,\$vec);
1.059 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul(\$B,\$vec);
6.660 μs (1 allocation: 896 bytes)
``````

Multiplying a sparse matrix with a dense vector is about six to seven times slower than multiplying a dense matrix with the same vector. That appears to be reasonable in this example.

Changing the vector to a view of a vector does not change the timings

``````julia> vec2=rand(200);
julia> vecview=@view vec2[1:100];

julia> @btime Z=mymul(\$A,\$vecview);
1.058 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul(\$B,\$vecview);
6.882 μs (1 allocation: 896 bytes)
``````

which is very nice. I was surprised how well the view works for either case, the overhead of a view is gone in the current version.

However, when using `reshape()` to make a vector form a view inside an array

``````julia> ar=rand(102,102);
julia> arview=@view ar[5:14,5:14];
julia> vec3=reshape(arview,:);

julia> @btime Z=mymul(\$A,\$vec3);
1.894 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul(\$B,\$vec3);
141.315 μs (1 allocation: 896 bytes)

julia> @which A*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49
julia> @which B*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49
``````

the performance hit in the dense matrix case is moderate while in the sparse case it is quite big.

Is the loss of performance in the dense case from the elements in `vec3` being no longer adjacent in memory ? Is this a case where for the given types the fallback method for the sparse matrix case is just slow ? This is not obvious to me from the output of `@which`.

Finally, is there a better way to construct the vector `vec3` ? Of course a `collect()` will remove the performance penalty but that creates a copy which I would like to avoid.

Best Regards

Yes, I think so:

``````julia> arview isa StridedArray
true

julia> vec3 isa StridedArray
false
``````

The dispatch to `generic_matvecmul!` happens a few steps later, I think here: `@which mul!(similar(vec3), A, vec3, true, false) ` which you can get to with `@less` a lot, or `Dubugger.@enter`.

Instead of `collect(vec3)`, perhaps you can re-use some vectors (both for this and for the output) allocated outside the fast bit?

Why `B*vec3` is slower I’m not sure.

Thank you for the confirmation. I will have to remember that it is possible to check if something is of a given type (if you now the type). I might try to drill deeper to understand what happens in the sparse case.

Yes, passing a “scratch vector” around would be a pragmatic solution.