Hello everybody,
this question is very similar to other performance questions with sparse matrix multiplications where performance hits boil down to julia falling back to unoptimized methods. Still, I would like to ask for confirmation and eventually some hint how to regain performance.
First I would like to establish a baseline timing for the MWE:
julia> using LinearAlgebra, SparseArrays, BenchmarkTools
julia> function mymul(A,x)
z=A*x
return z
end
mymul (generic function with 1 method)
julia> A=rand(100,100);
julia> B=sparse(rand(100,100));
julia> vec=rand(100);
julia> @btime Z=mymul($A,$vec);
1.059 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vec);
6.660 μs (1 allocation: 896 bytes)
Multiplying a sparse matrix with a dense vector is about six to seven times slower than multiplying a dense matrix with the same vector. That appears to be reasonable in this example.
Changing the vector to a view of a vector does not change the timings
julia> vec2=rand(200);
julia> vecview=@view vec2[1:100];
julia> @btime Z=mymul($A,$vecview);
1.058 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vecview);
6.882 μs (1 allocation: 896 bytes)
which is very nice. I was surprised how well the view works for either case, the overhead of a view is gone in the current version.
However, when using reshape()
to make a vector form a view inside an array
julia> ar=rand(102,102);
julia> arview=@view ar[5:14,5:14];
julia> vec3=reshape(arview,:);
julia> @btime Z=mymul($A,$vec3);
1.894 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vec3);
141.315 μs (1 allocation: 896 bytes)
julia> @which A*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49
julia> @which B*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49
the performance hit in the dense matrix case is moderate while in the sparse case it is quite big.
I would like to ask some questions about this.
Is the loss of performance in the dense case from the elements in vec3
being no longer adjacent in memory ? Is this a case where for the given types the fallback method for the sparse matrix case is just slow ? This is not obvious to me from the output of @which
.
Finally, is there a better way to construct the vector vec3
? Of course a collect()
will remove the performance penalty but that creates a copy which I would like to avoid.
Best Regards