The functions executed for A*y
are different for A
either real or complex.The first code uses BLAS gemv!
(checked via @code_native
)
A = rand(ComplexF64, n, n)
y = rand(Float64, n)
c = Vector{Complex64}(undef, n)
mul!(c,A,y)
n=20000
julia> @btime mul!($c, $A, $y)
254.250 ms (0 allocations: 0 bytes)
while
A = rand(Float64, n, n)
y = rand(ComplexF64, n)
c = Vector{Complex64}(undef, n)
mul!(c,A,y)
n=20000
julia> @btime mul!($c, $A, $y)
271.984 ms (0 allocations: 0 bytes)
julia> BLAS.set_num_threads(1)
julia> @btime mul!($c, $A, $y)
276.834 ms (0 allocations: 0 bytes)
uses generic_matvecmul!
. The performance for n=20000
and 4 threads, is still similar. But BLAS is multithreaded, while generic_matvecmul!
isnt. I can not check how both cases scale for larger n
and more threads right now.
I am interested in the latter case and wonder why it is not using BLAS and is not multithreaded? I thought generic_
functions are fallbacks in Julia and one should look for specialized functions when one is interested in performance.
Could someone give some insights in this?