Matrix-Vector multiplication complex/real

The functions executed for A*y are different for A either real or complex.The first code uses BLAS gemv! (checked via @code_native)

A = rand(ComplexF64, n,  n)
y = rand(Float64, n)
c = Vector{Complex64}(undef, n)

mul!(c,A,y)

n=20000
julia>     @btime mul!($c, $A, $y)
  254.250 ms (0 allocations: 0 bytes)

while

A = rand(Float64, n,  n)
y = rand(ComplexF64, n)
c = Vector{Complex64}(undef, n)

mul!(c,A,y)

n=20000
julia> @btime mul!($c, $A, $y)
  271.984 ms (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(1)

julia> @btime mul!($c, $A, $y)
  276.834 ms (0 allocations: 0 bytes)

uses generic_matvecmul!. The performance for n=20000 and 4 threads, is still similar. But BLAS is multithreaded, while generic_matvecmul! isnt. I can not check how both cases scale for larger n and more threads right now.

I am interested in the latter case and wonder why it is not using BLAS and is not multithreaded? I thought generic_ functions are fallbacks in Julia and one should look for specialized functions when one is interested in performance.

Could someone give some insights in this?