In-place multiplication is too much slower for ComplexF64

photor · February 6, 2022, 3:45am

The ordinary version:

using BenchmarkTools

const p = 60
const dd=randn(p,p)
const ψ=randn(p,p*p)+im*randn(p,p*p)

function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr=dd*ψ
    return ψr
end

@btime d2x(ψ);

The result is

1.330 ms (4 allocations: 3.35 MiB)

The in-place version:

function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    mul!(ψr,dd,ψ)
    return ψr
end

@btime d2x(ψ);

The result is

9.511 ms (8 allocations: 3.30 MiB)

Is it a bug or something else?

jling · February 6, 2022, 3:53am

here’s the fix:

julia> function d2x_2(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
           ψr = similar(ψ)
           mul!(ψr,convert(AbstractArray{ComplexF64}, dd),convert(AbstractArray{ComplexF64}, ψ))
           return ψr
       end

the reason? well I couldn’t believe * doesn’t call mul!() so when I looked at what’s happening during dd*ψ, this is what that the source code looks like:

# optimization for dispatching to BLAS, e.g. *(::Matrix{Float32}, ::Matrix{Float64})
# but avoiding the case *(::Matrix{<:BlasComplex}, ::Matrix{<:BlasReal})
# which is better handled by reinterpreting rather than promotion
function (*)(A::StridedMatrix{<:BlasReal}, B::StridedMatrix{<:BlasFloat})
    TS = promote_type(eltype(A), eltype(B))
    mul!(similar(B, TS, (size(A,1), size(B,2))), convert(AbstractArray{TS}, A), convert(AbstractArray{TS}, B))
end

link to the source code on github

photor · February 6, 2022, 4:04am

Thanks. It converts both matrices to complex ones, but dd is a real matrix, isn’t there some efficiency problem?

jling · February 6, 2022, 4:10am

depends on what you mean, this seems to be the best approach given the capability of BLAS and how we use it?

does any other language run significantly faster given the same amount of BLAS threads?

photor · February 6, 2022, 4:13am

I benchmarked your code and found no efficiency problem related to the complex dd conversion. I just feel strange here anyway.

photor · February 6, 2022, 4:14am

So, it is just a bug, if I understand correctly?

jling · February 6, 2022, 4:20am

not really, just use *, if you need mul!, do the same hack as *. at some point you need to make some hacks for BLAS, in this case it happened in mul!, doesn’t make things much better if mul!() wraps some impl_mul!() or __mul!()…

it’s unfortunate it happens in mul!() which is a public API

also, you’re definitely over typing, these are not needed and doesn’t give you any performance benefit

photor · February 6, 2022, 4:30am

Sorry, why is that? I thought that the Julia compiler was hard to infer the type of ψ, which might make it slow.

Oscar_Smith · February 6, 2022, 4:33am

As long as you can figure out the return type of your code, the compiler should be able to. (and if you are ever want to find out whether the compiler can figure it out, you should use @code_warntype)

Elrod · February 6, 2022, 4:33am

Mixing types is a great use case for pure-Julia implementations:

julia> using LinearAlgebra, Octavian, BenchmarkTools

julia> function d2x_0(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
           ψr=dd*ψ
           return ψr
       end
d2x_0 (generic function with 1 method)

julia> function d2x_1(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
           ψr = similar(ψ)
           mul!(ψr,dd,ψ)
           return ψr
       end
d2x_1 (generic function with 1 method)

julia> function d2x_2(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
                  ψr = similar(ψ)
                  mul!(ψr,convert(AbstractArray{ComplexF64}, dd),convert(AbstractArray{ComplexF64}, ψ))
                  return ψr
              end
d2x_2 (generic function with 1 method)

julia> function d2x_3(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
                  ψr = similar(ψ)
                  matmul!(ψr, dd, ψ)
                  return ψr
              end
d2x_3 (generic function with 1 method)

julia> @btime d2x_0($ψ);
  84.730 μs (4 allocations: 3.35 MiB)

julia> @btime d2x_1($ψ);
  9.113 ms (5 allocations: 3.32 MiB)

julia> @btime d2x_2($ψ);
  88.578 μs (4 allocations: 3.35 MiB)

julia> @btime d2x_3($ψ);
  65.860 μs (2 allocations: 3.30 MiB)

Octavian’s Complex matmul isn’t really optimized, so it won’t scale as well to large sizes.
If anyone is interested in improving it, let me know and I can describe what’s needed, answer any questions, etc.
Also, for those concerned about Octavian’s time to first matmul: this will hopefully largely solve the problem.

EDIT: We can also improve Octavian’s small-size complex matmul performance for CPUs with the FMA3 instruction set (which is basically any recent x86 CPU).

photor · February 6, 2022, 4:53am

In this situation Octavian performs pretty well on my machine (i7-4771 with Windows 10):

using Octavian, BenchmarkTools

const p = 60
const dd=randn(p,p)
const ψ=randn(p,p*p)+im*randn(p,p*p)

function d2x(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    matmul!(ψr,dd,ψ)
    return ψr
end

@btime d2x(ψ);

The result is

655.100 μs (3 allocations: 3.30 MiB)

compared to 1.330 ms using dd*ψ.

photor · February 6, 2022, 5:02am

If setting p=150, Octavian shows a 2.5 factor of speed boost compared to dd*ψ, impressive

photor · February 6, 2022, 7:33am

I found that the efficiency of Octavian drops dramatically if ψ is multiplied by dd from the right hand side:

using Octavian, BenchmarkTools

const p = 60
const ddt=randn(p,p)
const ψ=randn(p*p,p)+im*randn(p*p,p)

function d2z(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    matmul!(ψr,ψ,ddt)
    return ψr
end

@btime d2z(ψ);

with the result

3.986 ms (2 allocations: 3.30 MiB)

much slower than ψ*ddt (1.230 ms). Is it normal?

Elrod · February 6, 2022, 1:01pm

This is related to

The problem is fixable, but someone will have to get around to it.

In A*B, the size of matrix A is much more important than the size of B for determining scaling.
If anyone is interested in making the improvements I described in that comment, let me know and I can describe what needs to be done.

stevengj · February 6, 2022, 2:05pm

The whole point of Julia’s design is to enable type inference for properly written code. See the manual on argument-type declarations

d2x(ψ, dd) = dd * ψ

is every bit as fast as your code using * with a lot of type declarations, but it works for any type of arguments supporting * (it is “type generic”). (I also moved dd to be an argument rather than a const global, but that’s for generality and style rather than performance.) The compiler generates specialized versions of d2x for every combination of argument types detected at the call sites. This is the key advantage of Julia: you can write type-generic code that still compiles efficiently.

Regarding your performance issue here, the basic issue is that the optimized BLAS libraries don’t provide optimized implementations for real * complex matrix multiplies, so it is calling a slower fallback method here. (That’s the advantage of a pure-Julia implementation like Octavian, but optimizing matrix multiplies is a lot of work … OpenBLAS has 10s of thousands of lines of code just for matrix multiplies.)

On the other hand, if your matrices are really just 60x60, optimizing a kernel is in principle much easier, since it is pretty much just about SIMD.

photor · February 9, 2022, 2:01am

Is this a quick (dirty) fix?

using Octavian, BenchmarkTools

const p = 60
const dd=randn(p,p)
const ψ=randn(p*p,p)+im*randn(p*p,p)

function d2z(ψ::Array{ComplexF64,2})::Array{ComplexF64,2}
    ψr = similar(ψ)
    matmul!(ψr,dd,transpose(ψ))
    return ψr
end

@btime d2z(ψ);

with the result

826.200 μs (3 allocations: 3.30 MiB)

jling · February 9, 2022, 5:29am

https://github.com/JuliaLang/julia/pull/44074/files
https://github.com/JuliaLang/julia/pull/44011/files

these fix both * and mul!() in the master branch

jishnub · February 9, 2022, 6:40am

Unfortunately https://github.com/JuliaLang/julia/pull/44074 doesn’t fix the inplace mul! (the one for the matrix-vector multiplication does). This only improves Matrix{<:Real} * Matrix{<:Complex}.

photor · February 9, 2022, 9:32am

Oh, not really. The result should be transposed back to (p*p,p) (using permutedims to avoid lazy transposition), which costs more than 2 ms in total.

photor · February 10, 2022, 2:29am

Well, here is the real fix:

using Octavian, BenchmarkTools

const p = 60
const ddt=randn(p,p)
const ψ=randn(p*p,p)+im*randn(p*p,p)

function d2z(ψ::Array{ComplexF64,2})
    ψp = reinterpret(Float64,ψ)
    ψr = similar(ψp)
    matmul!(ψr,ψp,ddt)
    return reinterpret(ComplexF64,ψr)
end

@btime d2z(ψ);

with the result

871.600 μs (2 allocations: 3.30 MiB)

Topic		Replies	Views
Multiplication times for complex matrices: julia vs Matlab New to Julia question	18	2868	March 21, 2020
Is it possible for complex matrix multiplication and adjoint speed to catch up with matlab? New to Julia question	38	1408	October 11, 2023
Puzzling performance when multiplying real and complex matrices General Usage	3	495	February 12, 2019
Compiler optimizations for ComplexF64 vs Fortran Performance fortran , optimization , complex-numbers , compiler	7	558	March 30, 2023
Matrix-Vector multiplication complex/real Performance	0	379	March 14, 2021

In-place multiplication is too much slower for ComplexF64

Related topics