Hello everybody,

this question is very similar to other performance questions with sparse matrix multiplications where performance hits boil down to julia falling back to unoptimized methods. Still, I would like to ask for confirmation and eventually some hint how to regain performance.

First I would like to establish a baseline timing for the MWE:

```
julia> using LinearAlgebra, SparseArrays, BenchmarkTools
julia> function mymul(A,x)
z=A*x
return z
end
mymul (generic function with 1 method)
julia> A=rand(100,100);
julia> B=sparse(rand(100,100));
julia> vec=rand(100);
julia> @btime Z=mymul($A,$vec);
1.059 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vec);
6.660 μs (1 allocation: 896 bytes)
```

Multiplying a sparse matrix with a dense vector is about six to seven times slower than multiplying a dense matrix with the same vector. That appears to be reasonable in this example.

Changing the vector to a view of a vector does not change the timings

```
julia> vec2=rand(200);
julia> vecview=@view vec2[1:100];
julia> @btime Z=mymul($A,$vecview);
1.058 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vecview);
6.882 μs (1 allocation: 896 bytes)
```

which is very nice. I was surprised how well the view works for either case, the overhead of a view is gone in the current version.

However, when using `reshape()`

to make a vector form a view inside an array

```
julia> ar=rand(102,102);
julia> arview=@view ar[5:14,5:14];
julia> vec3=reshape(arview,:);
julia> @btime Z=mymul($A,$vec3);
1.894 μs (1 allocation: 896 bytes)
julia> @btime Z=mymul($B,$vec3);
141.315 μs (1 allocation: 896 bytes)
julia> @which A*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49
julia> @which B*vec3
*(A::AbstractArray{T,2}, x::AbstractArray{S,1}) where {T, S} in LinearAlgebra at /home/ckoe/bin/julia-1.5.0/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:49
```

the performance hit in the dense matrix case is moderate while in the sparse case it is quite big.

I would like to ask some questions about this.

Is the loss of performance in the dense case from the elements in `vec3`

being no longer adjacent in memory ? Is this a case where for the given types the fallback method for the sparse matrix case is just slow ? This is not obvious to me from the output of `@which`

.

Finally, is there a better way to construct the vector `vec3`

? Of course a `collect()`

will remove the performance penalty but that creates a copy which I would like to avoid.

Best Regards