The functions executed for `A*y`

are different for `A`

either real or complex.The first code uses BLAS `gemv!`

(checked via `@code_native`

)

```
A = rand(ComplexF64, n, n)
y = rand(Float64, n)
c = Vector{Complex64}(undef, n)
mul!(c,A,y)
n=20000
julia> @btime mul!($c, $A, $y)
254.250 ms (0 allocations: 0 bytes)
```

while

```
A = rand(Float64, n, n)
y = rand(ComplexF64, n)
c = Vector{Complex64}(undef, n)
mul!(c,A,y)
n=20000
julia> @btime mul!($c, $A, $y)
271.984 ms (0 allocations: 0 bytes)
julia> BLAS.set_num_threads(1)
julia> @btime mul!($c, $A, $y)
276.834 ms (0 allocations: 0 bytes)
```

uses `generic_matvecmul!`

. The performance for `n=20000`

and 4 threads, is still similar. But BLAS is multithreaded, while `generic_matvecmul!`

isnt. I can not check how both cases scale for larger `n`

and more threads right now.

I am interested in the latter case and wonder why it is not using BLAS and is not multithreaded? I thought `generic_`

functions are fallbacks in Julia and one should look for specialized functions when one is interested in performance.

Could someone give some insights in this?