using BenchmarkTools
const p = 60
const dd=randn(p,p)
const ψ=randn(p,p*p)+im*randn(p,p*p)
function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
ψr=dd*ψ
return ψr
end
@btime d2x(ψ);
The result is
1.330 ms (4 allocations: 3.35 MiB)
The in-place version:
function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
ψr = similar(ψ)
mul!(ψr,dd,ψ)
return ψr
end
@btime d2x(ψ);
julia> function d2x_2(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
ψr = similar(ψ)
mul!(ψr,convert(AbstractArray{ComplexF64}, dd),convert(AbstractArray{ComplexF64}, ψ))
return ψr
end
the reason? well I couldn’t believe * doesn’t call mul!() so when I looked at what’s happening during dd*ψ, this is what that the source code looks like:
# optimization for dispatching to BLAS, e.g. *(::Matrix{Float32}, ::Matrix{Float64})
# but avoiding the case *(::Matrix{<:BlasComplex}, ::Matrix{<:BlasReal})
# which is better handled by reinterpreting rather than promotion
function (*)(A::StridedMatrix{<:BlasReal}, B::StridedMatrix{<:BlasFloat})
TS = promote_type(eltype(A), eltype(B))
mul!(similar(B, TS, (size(A,1), size(B,2))), convert(AbstractArray{TS}, A), convert(AbstractArray{TS}, B))
end
not really, just use *, if you need mul!, do the same hack as *. at some point you need to make some hacks for BLAS, in this case it happened in mul!, doesn’t make things much better if mul!() wraps some impl_mul!() or __mul!()…
it’s unfortunate it happens in mul!() which is a public API
also, you’re definitely over typing, these are not needed and doesn’t give you any performance benefit
As long as you can figure out the return type of your code, the compiler should be able to. (and if you are ever want to find out whether the compiler can figure it out, you should use @code_warntype)
Mixing types is a great use case for pure-Julia implementations:
julia> using LinearAlgebra, Octavian, BenchmarkTools
julia> function d2x_0(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
ψr=dd*ψ
return ψr
end
d2x_0 (generic function with 1 method)
julia> function d2x_1(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
ψr = similar(ψ)
mul!(ψr,dd,ψ)
return ψr
end
d2x_1 (generic function with 1 method)
julia> function d2x_2(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
ψr = similar(ψ)
mul!(ψr,convert(AbstractArray{ComplexF64}, dd),convert(AbstractArray{ComplexF64}, ψ))
return ψr
end
d2x_2 (generic function with 1 method)
julia> function d2x_3(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
ψr = similar(ψ)
matmul!(ψr, dd, ψ)
return ψr
end
d2x_3 (generic function with 1 method)
julia> @btime d2x_0($ψ);
84.730 μs (4 allocations: 3.35 MiB)
julia> @btime d2x_1($ψ);
9.113 ms (5 allocations: 3.32 MiB)
julia> @btime d2x_2($ψ);
88.578 μs (4 allocations: 3.35 MiB)
julia> @btime d2x_3($ψ);
65.860 μs (2 allocations: 3.30 MiB)
Octavian’s Complex matmul isn’t really optimized, so it won’t scale as well to large sizes.
If anyone is interested in improving it, let me know and I can describe what’s needed, answer any questions, etc.
Also, for those concerned about Octavian’s time to first matmul: this will hopefully largely solve the problem.
EDIT: We can also improve Octavian’s small-size complex matmul performance for CPUs with the FMA3 instruction set (which is basically any recent x86 CPU).
The problem is fixable, but someone will have to get around to it.
In A*B, the size of matrix A is much more important than the size of B for determining scaling.
If anyone is interested in making the improvements I described in that comment, let me know and I can describe what needs to be done.
is every bit as fast as your code using * with a lot of type declarations, but it works for any type of arguments supporting * (it is “type generic”). (I also moved dd to be an argument rather than a const global, but that’s for generality and style rather than performance.) The compiler generates specialized versions of d2x for every combination of argument types detected at the call sites. This is the key advantage of Julia: you can write type-generic code that still compiles efficiently.
Regarding your performance issue here, the basic issue is that the optimized BLAS libraries don’t provide optimized implementations for real * complex matrix multiplies, so it is calling a slower fallback method here. (That’s the advantage of a pure-Julia implementation like Octavian, but optimizing matrix multiplies is a lot of work … OpenBLAS has 10s of thousands of lines of code just for matrix multiplies.)
On the other hand, if your matrices are really just 60x60, optimizing a kernel is in principle much easier, since it is pretty much just about SIMD.
Unfortunately https://github.com/JuliaLang/julia/pull/44074 doesn’t fix the inplace mul! (the one for the matrix-vector multiplication does). This only improves Matrix{<:Real} * Matrix{<:Complex}.