In-place multiplication is too much slower for ComplexF64

The ordinary version:

using BenchmarkTools

const p = 60
const dd=randn(p,p)
const ψ=randn(p,p*p)+im*randn(p,p*p)

function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr=dd*ψ
    return ψr
end

@btime d2x(ψ);

The result is

1.330 ms (4 allocations: 3.35 MiB)

The in-place version:

function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    mul!(ψr,dd,ψ)
    return ψr
end

@btime d2x(ψ);

The result is

9.511 ms (8 allocations: 3.30 MiB)

Is it a bug or something else?

here’s the fix:

julia> function d2x_2(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
           ψr = similar(ψ)
           mul!(ψr,convert(AbstractArray{ComplexF64}, dd),convert(AbstractArray{ComplexF64}, ψ))
           return ψr
       end

the reason? well I couldn’t believe * doesn’t call mul!() so when I looked at what’s happening during dd*ψ, this is what that the source code looks like:

# optimization for dispatching to BLAS, e.g. *(::Matrix{Float32}, ::Matrix{Float64})
# but avoiding the case *(::Matrix{<:BlasComplex}, ::Matrix{<:BlasReal})
# which is better handled by reinterpreting rather than promotion
function (*)(A::StridedMatrix{<:BlasReal}, B::StridedMatrix{<:BlasFloat})
    TS = promote_type(eltype(A), eltype(B))
    mul!(similar(B, TS, (size(A,1), size(B,2))), convert(AbstractArray{TS}, A), convert(AbstractArray{TS}, B))
end

link to the source code on github

1 Like

Thanks. It converts both matrices to complex ones, but dd is a real matrix, isn’t there some efficiency problem?

depends on what you mean, this seems to be the best approach given the capability of BLAS and how we use it?

does any other language run significantly faster given the same amount of BLAS threads?

1 Like

I benchmarked your code and found no efficiency problem related to the complex dd conversion. I just feel strange here anyway.

So, it is just a bug, if I understand correctly?

not really, just use *, if you need mul!, do the same hack as *. at some point you need to make some hacks for BLAS, in this case it happened in mul!, doesn’t make things much better if mul!() wraps some impl_mul!() or __mul!()

it’s unfortunate it happens in mul!() which is a public API

also, you’re definitely over typing, these are not needed and doesn’t give you any performance benefit

1 Like

Sorry, why is that? I thought that the Julia compiler was hard to infer the type of ψ, which might make it slow.

As long as you can figure out the return type of your code, the compiler should be able to. (and if you are ever want to find out whether the compiler can figure it out, you should use @code_warntype)

1 Like

Mixing types is a great use case for pure-Julia implementations:

julia> using LinearAlgebra, Octavian, BenchmarkTools

julia> function d2x_0(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
           ψr=dd*ψ
           return ψr
       end
d2x_0 (generic function with 1 method)

julia> function d2x_1(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
           ψr = similar(ψ)
           mul!(ψr,dd,ψ)
           return ψr
       end
d2x_1 (generic function with 1 method)

julia> function d2x_2(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
                  ψr = similar(ψ)
                  mul!(ψr,convert(AbstractArray{ComplexF64}, dd),convert(AbstractArray{ComplexF64}, ψ))
                  return ψr
              end
d2x_2 (generic function with 1 method)

julia> function d2x_3(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
                  ψr = similar(ψ)
                  matmul!(ψr, dd, ψ)
                  return ψr
              end
d2x_3 (generic function with 1 method)

julia> @btime d2x_0($ψ);
  84.730 μs (4 allocations: 3.35 MiB)

julia> @btime d2x_1($ψ);
  9.113 ms (5 allocations: 3.32 MiB)

julia> @btime d2x_2($ψ);
  88.578 μs (4 allocations: 3.35 MiB)

julia> @btime d2x_3($ψ);
  65.860 μs (2 allocations: 3.30 MiB)

Octavian’s Complex matmul isn’t really optimized, so it won’t scale as well to large sizes.
If anyone is interested in improving it, let me know and I can describe what’s needed, answer any questions, etc.
Also, for those concerned about Octavian’s time to first matmul: this will hopefully largely solve the problem.

EDIT: We can also improve Octavian’s small-size complex matmul performance for CPUs with the FMA3 instruction set (which is basically any recent x86 CPU).

5 Likes

In this situation Octavian performs pretty well on my machine (i7-4771 with Windows 10):

using Octavian, BenchmarkTools

const p = 60
const dd=randn(p,p)
const ψ=randn(p,p*p)+im*randn(p,p*p)

function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    matmul!(ψr,dd,ψ)
    return ψr
end

@btime d2x(ψ);

The result is

655.100 μs (3 allocations: 3.30 MiB)

compared to 1.330 ms using dd*ψ.

2 Likes

If setting p=150, Octavian shows a 2.5 factor of speed boost compared to dd*ψ, impressive

2 Likes

I found that the efficiency of Octavian drops dramatically if ψ is multiplied by dd from the right hand side:

using Octavian, BenchmarkTools

const p = 60
const ddt=randn(p,p)
const ψ=randn(p*p,p)+im*randn(p*p,p)

function d2z(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    matmul!(ψr,ψ,ddt)
    return ψr
end

@btime d2z(ψ);

with the result

3.986 ms (2 allocations: 3.30 MiB)

much slower than ψ*ddt (1.230 ms). Is it normal?

This is related to

The problem is fixable, but someone will have to get around to it.

In A*B, the size of matrix A is much more important than the size of B for determining scaling.
If anyone is interested in making the improvements I described in that comment, let me know and I can describe what needs to be done.

3 Likes

The whole point of Julia’s design is to enable type inference for properly written code. See the manual on argument-type declarations

d2x(ψ, dd) = dd * ψ

is every bit as fast as your code using * with a lot of type declarations, but it works for any type of arguments supporting * (it is “type generic”). (I also moved dd to be an argument rather than a const global, but that’s for generality and style rather than performance.) The compiler generates specialized versions of d2x for every combination of argument types detected at the call sites. This is the key advantage of Julia: you can write type-generic code that still compiles efficiently.

Regarding your performance issue here, the basic issue is that the optimized BLAS libraries don’t provide optimized implementations for real * complex matrix multiplies, so it is calling a slower fallback method here. (That’s the advantage of a pure-Julia implementation like Octavian, but optimizing matrix multiplies is a lot of work … OpenBLAS has 10s of thousands of lines of code just for matrix multiplies.)

On the other hand, if your matrices are really just 60x60, optimizing a kernel is in principle much easier, since it is pretty much just about SIMD.

5 Likes

Is this a quick (dirty) fix?

using Octavian, BenchmarkTools

const p = 60
const dd=randn(p,p)
const ψ=randn(p*p,p)+im*randn(p*p,p)

function d2z(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    matmul!(ψr,dd,transpose(ψ))
    return ψr
end

@btime d2z(ψ);

with the result

826.200 μs (3 allocations: 3.30 MiB)

https://github.com/JuliaLang/julia/pull/44074/files
https://github.com/JuliaLang/julia/pull/44011/files

these fix both * and mul!() in the master branch

Unfortunately https://github.com/JuliaLang/julia/pull/44074 doesn’t fix the inplace mul! (the one for the matrix-vector multiplication does). This only improves Matrix{<:Real} * Matrix{<:Complex}.

Oh, not really. The result should be transposed back to (p*p,p) (using permutedims to avoid lazy transposition), which costs more than 2 ms in total.

Well, here is the real fix:

using Octavian, BenchmarkTools

const p = 60
const ddt=randn(p,p)
const ψ=randn(p*p,p)+im*randn(p*p,p)

function d2z(ψ::Array{ComplexF64,2})
    ψp = reinterpret(Float64,ψ)
    ψr = similar(ψp)
    matmul!(ψr,ψp,ddt)
    return reinterpret(ComplexF64,ψr)
end

@btime d2z(ψ);

with the result

871.600 μs (2 allocations: 3.30 MiB)
1 Like