Perhaps the following puzzle will ring a bell for someone?
Multiply mul!(H, B, Gamma)
:
(eltype(H), eltype(B), eltype(Gamma)) = (Complex{Float64}, Complex{Float64}, Complex{Float64})
It takes 13.091117 seconds
Multiply mul!(Vn, Gamma, P)
:
(eltype(Vn), eltype(Gamma), eltype(P), size(Gamma), size(P)) = (Complex{Float64}, Complex{Float64}, Complex{Float64}, (4436, 4436), (4436, 181))
It takes 0.643775 seconds
Now we change Gamma
to be real, everything else is the same.
(eltype(H), eltype(B), eltype(Gamma)) = (Complex{Float64}, Complex{Float64}, Float64)
The timing for the first product decreased with more than a factor of two (4.532288 seconds).
The timing for the second product increased by about a factor of 8:
(eltype(Vn), eltype(Gamma), eltype(P), size(Gamma), size(P)) = (Complex{Float64}, Float64, Complex{Float64}, (4436, 4436), (4436, 181))
This takes 4.958194 seconds (6 allocations: 336 bytes)
I could track it down to _generic_matmatmul!
. I believe tiled execution is selected. (There are some scary # FIXME: This code is completely invalid!!!
warnings around!) But that’s where I got stuck. I don’t see why changing the type of one of the arrays should speed up one product and slow down another one.