Possible slow fallback of adjoint matrix multiplication for different element types

I was running some matrix diagonalization and processing the eigenvectors. I notice that the script takes an unusually long time and I realize that it could be a problem of the adjoint matrix multiplication.

Here is the testing script:

using LinearAlgebra, BenchmarkTools, Cthulhu

function adjoint_mul1(dim::Int)
  A = rand(ComplexF64, dim, dim)
  B = rand(Int, dim, dim)
  A' * B
end

function adjoint_mul2(dim::Int)
  A = rand(ComplexF64, dim, dim)
  B = rand(ComplexF64, dim, dim)
  A' * B
end

@btime a = adjoint_mul1(2^11)
@btime b = adjoint_mul2(2^11)

The result is as follows:

  12.410 s (6 allocations: 160.00 MiB)
  191.027 ms (6 allocations: 192.00 MiB)

When the matrix size grows even larger, say (4096,4096), basically adjoint_mul1() will take forever (well actually 100+ seconds for 4096). By checking the function calltree, it appears that adjoint_mul1() eventually uses generic_matmatmul!

# Line 455 in https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/matmul.jl
@inline function mul!(C::AbstractMatrix, adjA::Adjoint{<:Any,<:AbstractVecOrMat}, B::AbstractVecOrMat,
                 alpha::Number, beta::Number)
    A = adjA.parent
    return generic_matmatmul!(C, 'C', 'N', A, B, MulAddMul(alpha, beta))
end

and adjoint_mul2() eventually uses blas wrapper

# Line 446 in https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/matmul.jl
@inline function mul!(C::StridedMatrix{T}, adjA::Adjoint{<:Any,<:StridedVecOrMat{T}}, B::StridedVecOrMat{T},
                 alpha::Number, beta::Number) where {T<:BlasComplex}
    A = adjA.parent
    if A===B
        return herk_wrapper!(C, 'C', A, MulAddMul(alpha, beta))
    else
        return gemm_wrapper!(C, 'C', 'N', A, B, MulAddMul(alpha, beta))
    end
end

I wonder if this is by purpose or it should be opened as an issue for better performance? I understand that this is a problem of dispatching to BLAS when different types are used.

For me, I feel that it seems a type reinterpretation is missing such that when different types are used, a slow fallback is called. And my personal feeling is that code like A' * B should take care of type promotion (not exactly what should be done there, perhaps more of reinterpretation) itself for generic users.