 # Possible slow fallback of adjoint matrix multiplication for different element types

I was running some matrix diagonalization and processing the eigenvectors. I notice that the script takes an unusually long time and I realize that it could be a problem of the adjoint matrix multiplication.

Here is the testing script:

``````using LinearAlgebra, BenchmarkTools, Cthulhu

function adjoint_mul1(dim::Int)
A = rand(ComplexF64, dim, dim)
B = rand(Int, dim, dim)
A' * B
end

function adjoint_mul2(dim::Int)
A = rand(ComplexF64, dim, dim)
B = rand(ComplexF64, dim, dim)
A' * B
end

@btime a = adjoint_mul1(2^11)
@btime b = adjoint_mul2(2^11)
``````

The result is as follows:

``````  12.410 s (6 allocations: 160.00 MiB)
191.027 ms (6 allocations: 192.00 MiB)
``````

When the matrix size grows even larger, say (4096,4096), basically `adjoint_mul1()` will take forever (well actually 100+ seconds for 4096). By checking the function calltree, it appears that `adjoint_mul1()` eventually uses `generic_matmatmul!`

``````# Line 455 in https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/matmul.jl
@inline function mul!(C::AbstractMatrix, adjA::Adjoint{<:Any,<:AbstractVecOrMat}, B::AbstractVecOrMat,
alpha::Number, beta::Number)
A = adjA.parent
return generic_matmatmul!(C, 'C', 'N', A, B, MulAddMul(alpha, beta))
end
``````

and `adjoint_mul2()` eventually uses `blas` wrapper

``````# Line 446 in https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/matmul.jl
@inline function mul!(C::StridedMatrix{T}, adjA::Adjoint{<:Any,<:StridedVecOrMat{T}}, B::StridedVecOrMat{T},
alpha::Number, beta::Number) where {T<:BlasComplex}
A = adjA.parent
if A===B
return herk_wrapper!(C, 'C', A, MulAddMul(alpha, beta))
else
return gemm_wrapper!(C, 'C', 'N', A, B, MulAddMul(alpha, beta))
end
end
``````

I wonder if this is by purpose or it should be opened as an issue for better performance? I understand that this is a problem of dispatching to BLAS when different types are used.

For me, I feel that it seems a type reinterpretation is missing such that when different types are used, a slow fallback is called. And my personal feeling is that code like `A' * B` should take care of type promotion (not exactly what should be done there, perhaps more of reinterpretation) itself for generic users.