SIMD Complex Numbers

Hi,
I hope this hasn’t been discussed before, but has SIMD support for complex numbers been considered? There are several libraries in C/C++ taking advantage of this. There’s an example of implementation here and vectorized complex numbers appear in projects like spiral for FFTs and others (can’t link due to new user). Performance gains can be pretty significant, especially if it can be combined with other SIMD abilities, and most general-purpose CPUs support vectorization at this point.

2 Likes

Welcome! FWIW, please don’t link to copyrighted material without an open license without warning people about it :slight_smile:

/*
 * Copyright (C) 2006 Intel Corporation.  All rights reserved.
 *
 * The information and source code contained herein is the exclusive
 * property  of Intel Corporation and may not be disclosed, examined,
 * or reproduced in  whole or in part without explicit written
 * authorization from the Company.
 *
 * [Description]
 * This code sample demonstrates the use of C in comparison with SSE2
 * and SSE3 instrinsics to multiply two complex numbers.
 *
 * [Compile]
 * icc double_complex.c (linux)
 * icl double_complex.c (windows)
 *
 * [Output]
 * Complex Product(C):    23.00+ -2.00i
 * Complex Product(SSE3): 23.00+ -2.00i
 * Complex Product(SSE2): 23.00+ -2.00i
 */
2 Likes

It’s not that it has not been considered - far from it! It’s just that there’s usually no need to take care of this explicitly in the core language (and a lot of code using complex numbers already uses SIMD through LLVM just fine). As such, I’ve moved the topic from “Interals & Design” to “Usage”.

You can check out e.g. LoopVectorization.jl and the JuliaSIMD Github Org for packages that aid in using SIMD in julia code.

1 Like

Sorry about the copyright! I’ll keep it in mind for the future, and I appreciate the recategorization.

Also, it seems like my original post linked to the same thing twice, so I corrected that. The problem that I’m suggesting though is this behavior–

julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i5-11600K @ 3.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)

julia> x = 1.5 + 2.3im
1.5 + 2.3im

julia> y = 5.2 + 0.1im
5.2 + 0.1im

julia> @code_llvm x*y
;  @ complex.jl:277 within `*'
; Function Attrs: uwtable
define void @"julia_*_513"([2 x double]* noalias nocapture sret %0, [2 x double]* nocapture nonnull readonly align 8 dereferenceable(16) %1, [2 x double]* nocapture nonnull readonly align 8 dereferenceable(16) %2) #0 {
top:
; ┌ @ complex.jl:63 within `real'
; │┌ @ Base.jl:33 within `getproperty'
    %3 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 0
    %4 = getelementptr inbounds [2 x double], [2 x double]* %2, i64 0, i64 0
; └└
;  @ complex.jl:277 within `*' @ float.jl:332
  %5 = load double, double* %3, align 8
  %6 = load double, double* %4, align 8
  %7 = fmul double %5, %6
;  @ complex.jl:277 within `*'
; ┌ @ complex.jl:76 within `imag'
; │┌ @ Base.jl:33 within `getproperty'
    %8 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 1
    %9 = getelementptr inbounds [2 x double], [2 x double]* %2, i64 0, i64 1
; └└
;  @ complex.jl:277 within `*' @ float.jl:332
  %10 = load double, double* %8, align 8
  %11 = load double, double* %9, align 8
  %12 = fmul double %10, %11
;  @ complex.jl:277 within `*'
; ┌ @ float.jl:329 within `-'
   %13 = fsub double %7, %12
; └
;  @ complex.jl:277 within `*' @ float.jl:332
  %14 = fmul double %11, %5
  %15 = fmul double %10, %6
;  @ complex.jl:277 within `*'
; ┌ @ float.jl:326 within `+'
   %16 = fadd double %14, %15
; └
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
  store double %13, double* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 1
  store double %16, double* %.sroa.2.0..sroa_idx1, align 8
  ret void
}

There are four multiplications here and two additions (as you would see in any naive implementation), which is fine, but with vectorization you can accelerate this reasonably, even without putting things in arrays or loops.

How would you vectorize a single complex multiplication? I think once you consider the cost of repacking, it would be more expensive.

1 Like

That’s what I’m curious about. I’d love to mock up a version, but I’m hampered by my lack of knowledge of LLVM. However, I have a working version of complex multiplication in C/C++ using those intrinsics from a library I’m working on. Here’s a link to a library I’ve worked on (MIT License), where pack<double, 2>::type is __m128d in AVX intrinsic notation. I genuinely don’t know if packing/unpacking would make it more expensive, but I suspect it’s cheaper because (at least in GCC) the __m128d is really just an aligned double* with 2 elements so I don’t think LLVM would have to move anything. If I’m not mistaken, the LLVM code above already aligns the memory anyway (though, again, I don’t read any LLVM).

Here are some benchmarks for the current implementation:

julia> using BenchmarkTools

julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))       
BechmarkTools.Trial: 10000 samples with 1000 evaluations.                 
 Range (min … max):  3.400 ns … 44.200 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.600 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   3.859 ns ±  1.824 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                          
   █▇▂▃ ▃                                                    ▁            
  ▇████▆██▆▄▄▄▅▆▄▄▄▄▅▅▅▃▄▄▄▄▁▃▃▃▄▁▃▁▃▄▃▃▄▄▃▃▄▅▃▃▃▃▄▃▄▄▄▃▁▄▁▃ █            
  3.4 ns       Histogram: log(frequency) by time      9.6 ns <            
                                                                          
 Memory estimate: 0 bytes, allocs estimate: 0.                            
                                                                          
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))       
BechmarkTools.Trial: 10000 samples with 1000 evaluations.                 
 Range (min … max):  2.800 ns … 40.400 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.900 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   3.038 ns ±  1.631 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                          
  ██▄ ▂▃ ▁                                                   ▂            
  ███▁██▁█▆▅▁▆▅▁▅▅▅▁▆▆▁▅▅▁▅▁▅▁▄▄▁▄▃▁▁▄▄▁▄▄▁▅▄▄▁▃▄▁▄▄▄▁▄▅▁▁▄▃ █            
  2.8 ns       Histogram: log(frequency) by time      6.9 ns <            
                                                                          
 Memory estimate: 0 bytes, allocs estimate: 0.                            

So if you manage to get that faster (for a single multiplication!) I’d be very interested in how you got vector (un)packing faster than that.

For complex matmuls, julia already calls BLAS, so the single mul benchmark is not representative for that (or any other more complex code). If you multiply a bunch of aligned complex values, julia may SIMD already anyway, even if the single multiplication is not SIMDed optimally.

2 Likes

You seem to be right, though I’m curious about the results if you don’t have to pack/unpack (i.e. if things were stored in the vectorized type to start). Here’s what I used to compare-

using Pkg
Pkg.activate(temp=true)
Pkg.add(["SIMD", "BenchmarkTools"])
using SIMD, BenchmarkTools
import Base: *

@benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))

@benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))

function *(x::Complex{T}, y::Complex{T}) where T<:AbstractFloat
    v = Vec{2,T}((-1.,1.))
    xx = Vec{2, T}((x.re, x.im))
    yy = Vec{2, T}((y.re, y.im))
    cc = shufflevector(yy, Val((0,0)))
    ba = shufflevector(xx, Val((1,0)))
    dd = shufflevector(yy, Val((1,1)))
    dba = flipsign(ba*dd,v)
    res = muladd(xx, cc, dba)
    Complex(res[1], res[2])
end

@benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))

@benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))

This gave results (on my computer) of

julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.800 ns … 27.700 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.900 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.888 ns ±  0.289 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇                  █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▂
  0.8 ns         Histogram: frequency by time         1.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.000 ns … 7.200 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.100 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.084 ns ± 0.126 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                 █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▂
  1 ns           Histogram: frequency by time        1.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

and

julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.800 ns … 31.000 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.900 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.878 ns ±  0.361 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃                           █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▂
  1.8 ns         Histogram: frequency by time           2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.800 ns … 17.100 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.900 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.886 ns ±  0.208 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇             █             ▃                              ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▄ █
  1.8 ns       Histogram: log(frequency) by time      2.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

But if I measure the time to actually pack, it’s pretty sizeable

julia> f(x::Complex{T}, y::Complex{T}) where T<:Union{Float32,Float64} = (Vec{2,T}((x.re,x.im)), Vec{2,T}((y.re, y.im)))
f (generic function with 1 method)

julia>

julia> @benchmark f(x,y) setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.200 ns … 28.800 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.200 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.253 ns ±  0.383 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                        ▅
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▂
  1.2 ns         Histogram: frequency by time         1.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f(x,y) setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.800 ns … 2.700 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.900 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.941 ns ± 0.126 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄          █          ▃           ▁
  █▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂ ▂
  0.8 ns         Histogram: frequency by time        1.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Some notes here though-- my CPU is theoretically 3.90GHz, pinning one clock cycle at around 0.25ns. I’m not qualified to get into whether I’m going over that or what the CPU is actually executing in one clock cycle, but I think it’s safe to distrust these benchmarks’ exact accuracy. There’s some further exploration I’d love to take on whenever I get the time, but I thought I’d document this case study for others and see if anyone else might have thoughts. I’d also love to save a few clock cycles and tap into the LLVM intrinsic for fmaddsub instead of having to create v just to use flipsign, but llvmcall gave me some trouble so I had to give up.

I can add it to VectorizationBase.jl. You can file an issue there so I don’t forget.

2 Likes

Reducing Packing Overhead in Matrix-Matrix Multiplication

Crunching Numbers with AVX and AVX2

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

I put up an issue in SIMD.jl since that seemed like the more pertinent package for this instruction. It seems I missed an issue in that repo talking about Complex numbers that seemed to fizzle out, so I guess this was fate.

Reducing Packing Overhead in Matrix-Matrix Multiplication

Crunching Numbers with AVX and AVX2

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

@blackeneth I actually linked that third one in my original post :smiley:. I’ve done this before in C/C++ with AVX, but I haven’t used LLVM before (otherwise I’d be using those intrinsics). The packing/unpacking is a bit simplistic in this example since I’m not doing any GEMM (so I don’t have to worry about caching/paging), it’s just loading up the real and imaginary parts of the complex numbers into one Vec{N,T}. What I would do if it were more than one complex number would definitely be something like vload(reinterpret(Float64, StructArray(v))), but that’s probably another topic for another day.

For what it’s worth, your modified * is a lot slower on my machine (code copy-pasted from above):

julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))       
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.                
 Range (min … max):  3.400 ns … 27.300 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.600 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   3.905 ns ±  1.748 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                          
  █▇▃▄                                                       ▁            
  ████▆▅▆█▇▇▇▆▆▇▆▅▆▅▅▅▄▅▄▅▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▃▁▃▃▁▆ █            
  3.4 ns       Histogram: log(frequency) by time     16.5 ns <            
                                                                          
 Memory estimate: 0 bytes, allocs estimate: 0.                            
                                                                          
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))       
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.                
 Range (min … max):  2.800 ns … 37.500 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.900 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   2.937 ns ±  0.963 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                          
  █  █  ▁     ▁                                              ▂            
  █▁▁█▁▁█▁▁█▁▁█▁▁▁█▁▁▅▁▁▄▁▁▆▁▁▁▅▁▁▆▁▁▅▁▁▇▁▁▆▁▁▁▇▁▁▆▁▁▅▁▁▄▁▁▃ █            
  2.8 ns       Histogram: log(frequency) by time      4.6 ns <            
                                                                          
 Memory estimate: 0 bytes, allocs estimate: 0.                            
                                                                          
julia> function *(x::Complex{T}, y::Complex{T}) where T<:AbstractFloat    
           v = Vec{2,T}((-1.,1.))                                         
           xx = Vec{2, T}((x.re, x.im))                                   
           yy = Vec{2, T}((y.re, y.im))                                   
           cc = shufflevector(yy, Val((0,0)))                             
           ba = shufflevector(xx, Val((1,0)))                             
           dd = shufflevector(yy, Val((1,1)))                             
           dba = flipsign(ba*dd,v)                                        
           res = muladd(xx, cc, dba)                                      
           Complex(res[1], res[2])                                        
       end                                                                
* (generic function with 371 methods)                                     
                                                                          
julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))       
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.                
 Range (min … max):  4.800 ns … 99.900 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.900 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   5.421 ns ±  2.735 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                          
  █▆▄▃▁▁                                                     ▁            
  ██████████▇▆▆▇▆▇▆▅▆▆▄▅▆▅▄▄▃▄▄▃▁▄▁▃▃▁▃▁▁▁▃▃▃▁▃▃▄▁▄▃▃▁▁▁▄▆▆▇ █            
  4.8 ns       Histogram: log(frequency) by time     18.5 ns <            
                                                                          
 Memory estimate: 0 bytes, allocs estimate: 0.                            
                                                                          
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))       
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.                
 Range (min … max):  4.200 ns … 56.600 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.300 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   4.504 ns ±  1.648 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                          
  ▁█▅ ▁  ▂                                                   ▁            
  ███▇█▅██▁▆▅▄▄▃▄▃▁▃▃▄▄▅▄▄▁▅▄▄▃▄▄▃▄▁▄▅▄▄▁▄▄▁▄▄▃▄▄▄▃▁▄▄▃▃▄▃▄▄ █            
  4.2 ns       Histogram: log(frequency) by time      9.3 ns <            
                                                                          
 Memory estimate: 0 bytes, allocs estimate: 0.                            

Probably caused by (un)packing being much slower:

julia> f(x::Complex{T}, y::Complex{T}) where T<:Union{Float32,Float64} = (Vec{2,T}((x.re,x.im)), Vec{2,T}((y.re, y.im)))
f (generic function with 1 method)                                                                                      
                                                                                                                        
julia> @benchmark f(x,y) setup=(x=rand(ComplexF32);y=rand(ComplexF32))                                                  
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.                                                              
 Range (min … max):  3.100 ns … 58.600 ns  ┊ GC (min … max): 0.00% … 0.00%                                              
 Time  (median):     3.100 ns              ┊ GC (median):    0.00%                                                      
 Time  (mean ± σ):   3.347 ns ±  1.833 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%                                              
                                                                                                                        
  █▆▂  ▃▂                                                    ▁                                                          
  ████▁██▆▁▅▄▄▃▁▅▅▅▁▆▇▆▅▁▅▅▅▁▄▅▄▁▁▃▄▁▁▃▅▅▄▁▃▃▄▁▅▄▄▄▁▅▁▄▁▃▃▁▃ █                                                          
  3.1 ns       Histogram: log(frequency) by time      7.6 ns <                                                          
                                                                                                                        
 Memory estimate: 0 bytes, allocs estimate: 0.                                                                          
                                                                                                                        
julia> @benchmark f(x,y) setup=(x=rand(ComplexF64);y=rand(ComplexF64))                                                  
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.                                                              
 Range (min … max):  2.800 ns … 45.000 ns  ┊ GC (min … max): 0.00% … 0.00%                                              
 Time  (median):     2.900 ns              ┊ GC (median):    0.00%                                                      
 Time  (mean ± σ):   3.049 ns ±  1.177 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%                                              
                                                                                                                        
  █ ▆ █ ▆ ▂    ▁                                             ▂                                                          
  █▁█▁█▁█▁█▁▁█▁█▁▇▁▆▁▁▅▁▅▁▆▁▇▁█▁▁█▁█▁█▁▇▁▁█▁▇▁▆▁▅▁▁▅▁▆▁▄▁▄▁▅ █                                                          
  2.8 ns       Histogram: log(frequency) by time      5.4 ns <                                                          
                                                                                                                        
 Memory estimate: 0 bytes, allocs estimate: 0.                                                                          
Native code comparison

Your multiply:

julia> @code_native debuginfo=:none rand(ComplexF64)*rand(ComplexF64)           
        .text                                                                   
        movq    %rdi, %rax                                                      
        vmovupd (%rsi), %xmm0                                                   
        vmovddup        (%rdx), %xmm1                   # xmm1 = mem[0,0]       
        vpermilpd       $1, %xmm0, %xmm2        # xmm2 = xmm0[1,0]              
        vmovddup        8(%rdx), %xmm3                  # xmm3 = mem[0,0]       
        vmulpd  %xmm3, %xmm2, %xmm2                                             
        movabsq $.rodata.cst16, %rcx                                            
        vxorpd  (%rcx), %xmm2, %xmm3                                            
        vblendpd        $1, %xmm3, %xmm2, %xmm2         # xmm2 = xmm3[0],xmm2[1]
        vfmadd231pd     %xmm1, %xmm0, %xmm2     # xmm2 = (xmm0 * xmm1) + xmm2   
        vmovupd %xmm2, (%rdi)                                                   
        retq                                                                    
        nopl    (%rax,%rax)                                                     
                                                                                
julia>                                                                          

Original multiply:

$ julia -q  # new session                                                    
julia> @code_native debuginfo=:none rand(ComplexF64)*rand(ComplexF64)           
        .text                                                                   
        movq    %rdi, %rax                                                      
        vmovsd  (%rsi), %xmm0                   # xmm0 = mem[0],zero            
        vmovsd  8(%rsi), %xmm1                  # xmm1 = mem[0],zero            
        vmovsd  (%rdx), %xmm2                   # xmm2 = mem[0],zero            
        vmovsd  8(%rdx), %xmm3                  # xmm3 = mem[0],zero            
        vmulsd  %xmm2, %xmm0, %xmm4                                             
        vmulsd  %xmm3, %xmm1, %xmm5                                             
        vsubsd  %xmm5, %xmm4, %xmm4                                             
        vmulsd  %xmm0, %xmm3, %xmm0                                             
        vmulsd  %xmm2, %xmm1, %xmm1                                             
        vaddsd  %xmm1, %xmm0, %xmm0                                             
        vmovsd  %xmm4, (%rdi)                                                   
        vmovsd  %xmm0, 8(%rdi)                                                  
        retq                                                                    
        nopw    (%rax,%rax)                                                     
julia> versioninfo()                           
Julia Version 1.7.0-beta3                      
Commit e76c9dad42 (2021-07-07 08:12 UTC)       
Platform Info:                                 
  OS: Linux (x86_64-linux-gnu)                 
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64                                
  LIBM: libopenlibm                            
  LLVM: libLLVM-12.0.0 (ORCJIT, skylake)       
Environment:                                   
  JULIA_PKG_SERVER =                           
  JULIA_NUM_THREADS = 4                        

So yeah, I know it’s tempting to lose yourself in these kinds of nanobenchmarks, but you’ve got to keep in mind that running a single complex multiply will not matter in the grand scheme of things. There are too many variables in play when that multiply is surrounded by other pieces of code (which may compile to completely different assembly as e.g. julia or LLVM notices that you’re doing the same kinds of multiplies over and over and SIMDs them smarter). Of course that’s not to say that LLVM or julia are perfect at this (LoopVectorization is the example that there often is still performance left on the table) but it’s very situational.

I don’t think SIMD.jl is more pertinent than VectorizationBase.jl for this instruction, so I went ahead and added it in this commit.
It also uses feature detection to fall back on a version based on shuffles for CPUs without FMA.
Examples:

julia> vx2 = Vec(ntuple(_ -> rand(), Val(2))...)
Vec{2, Float64}<0.37237382789924234, 0.6382509798796134>

julia> vy2 = Vec(ntuple(_ -> rand(), Val(2))...)
Vec{2, Float64}<0.7272098000262965, 0.7430448048335492>

julia> vz2 = Vec(ntuple(_ -> rand(), Val(2))...)
Vec{2, Float64}<0.011526509562512555, 0.7452696191090189>

julia> VectorizationBase.vfmaddsub(vx2,vy2,vz2)
Vec{2, Float64}<0.259267387359122, 1.2195186938884879>

julia> @cn VectorizationBase.vfmaddsub(vx2,vy2,vz2)
        .text
        mov     rax, rdi
        vmovapd xmm0, xmmword ptr [rsi]
        vmovapd xmm1, xmmword ptr [rdx]
        vfmaddsub213pd  xmm1, xmm0, xmmword ptr [rcx] # xmm1 = (xmm0 * xmm1) +/- mem
        vmovapd xmmword ptr [rdi], xmm1
        ret
        nop     word ptr cs:[rax + rax]

julia> vx = Vec(ntuple(_ -> rand(), pick_vector_width(Float64))...)
Vec{8, Float64}<0.0371378438550225, 0.6101315964217398, 0.18702462021703203, 0.06051847376330244, 0.31747377259357323, 0.09423184857609468, 0.3576888945419936, 0.2587005195532206>

julia> vy = Vec(ntuple(_ -> rand(), pick_vector_width(Float64))...)
Vec{8, Float64}<0.10039035844487065, 0.7621525414057622, 0.9114039069105686, 0.11942843942193893, 0.9797458325416961, 0.8660491193722564, 0.48745801987314885, 0.8677852992519912>

julia> vz = Vec(ntuple(_ -> rand(), pick_vector_width(Float64))...)
Vec{8, Float64}<0.021390827947836533, 0.7750678328506346, 0.655633872215031, 0.789299661365898, 0.1665200857675322, 0.09574529244474994, 0.15962217583127325, 0.13983981758371034>

julia> VectorizationBase.vfmaddsub(vx,vy,vz)
Vec{8, Float64}<-0.017662546491361186, 1.2400811796554183, -0.4851789026607627, 0.7965272882436468, 0.14452351987231132, 0.17735470192089656, 0.014736144432782509, 0.3643363253608475>

julia> @cn VectorizationBase.vfmaddsub(vx,vy,vz)
        .text
        mov     rax, rdi
        vmovupd zmm0, zmmword ptr [rsi]
        vmovupd zmm1, zmmword ptr [rdx]
        vfmaddsub213pd  zmm1, zmm0, zmmword ptr [rcx] # zmm1 = (zmm0 * zmm1) +/- mem
        vmovapd zmmword ptr [rdi], zmm1
        vzeroupper
        ret
        nop
4 Likes

Interesting- sorry about the hassle. I only did that since SIMD.jl already had fmadd, so fmaddsub seemed to naturally extend. I appreciate the help!

I’ve now added vfmsubadd aswell.

VectorizationBase.jl, like SIMD.jl, wraps a lot of LLVM intrinsics.
It also provides an API around some hardware info (through HWloc.jl and parsing LLVMGetHostCPUFeatures) so code can use hardware-specific features correctly.
VectorizationBase’s primary purpose is to serve LoopVectorization.jl.
It could use some more documentation and unit tests.

3 Likes

Thanks to @Elrod I’m basically tying normal complex arithmetic on my system. I’d love to see how this performs on other’s systems though.

julia> using VectorizationBase: Vec, shufflevector, vfmaddsub, mul_ieee

julia> function my_mul(x::Complex{T}, y::Complex{T}) where T<:AbstractFloat
           xx = Vec(x.re,x.im)
           yy = Vec(y.re,y.im)
           cc = shufflevector(yy, Val((0,0)))
           ba = shufflevector(xx, Val((1,0)))
           dd = shufflevector(yy, Val((1,1)))
           dba = mul_ieee(ba,dd)
           res = vfmaddsub(xx,cc,dba)
           Complex{T}(res(1), res(2))
       end
my_mul (generic function with 1 method)

julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.200 ns … 15.300 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.300 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.265 ns ±  0.176 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇          █
  █▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂ ▂
  1.2 ns         Histogram: frequency by time         1.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark my_mul(x,y) setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.200 ns … 16.300 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.200 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.252 ns ±  0.164 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                            ▆
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  1.2 ns         Histogram: frequency by time         1.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This doesn’t actually work for ComplexF32 because there is no 2-pack for 32-bit floats. I’ll probably look more into a fully fleshed out type after work. I also definitely agree with your point about how it’s easy to lose the forest for the trees (I’ve certainly been known to do that), and I also know that Julia’s stdlib isn’t guaranteeing the fastest possible implementation of anything. I wanted to start a conversation on how I would do this, and I really appreciate the help and feedback from everyone in the thread. However, I still somewhat disagree that a single complex multiply doesn’t matter in the grand scheme of things-- the multiplication is the foundation and bottleneck for any arithmetic algorithm with complex numbers, and speeding it up will speed up everything that builds on top of it. That being said, for day-to-day work, I certainly agree that the built in complex multiplication is more than enough

@danny.s VectorizationBase.jl is cool because it has by far the most user-friendly frontend in LoopVectorization.jl’s @turbo and @tturbo – it’s worth checking out if you haven’t yet

IIRC there is maybe even some possibility of @turbo supporting operations on isbits structs in the future?

I disagree. My point is precisely that the single multiplication may not be the most efficient building block when e.g. writing a matrix multiply (especially when the single multiplication is already written in a complex manner with intrinsics which may be opaque to the compiler) or multiple values in a loop. This in turn may lead to it being unable to fuse multiple multiplications.

If you’re looking at a single multiplication, yes it’s faster, but the difference is on the order of noise when examined in context with other code surrounding that multiply.

I’m sure you’re aware of it, but often it’s better to avoid repeated un- and repacking of values to vector registers and instead do as little packing with as many values at the same time as possible.

I won’t be in reach of my laptop for a few hours, but I’ll nevertheless be happy to benchmark your code on my machine once I’m back.

3 Likes

I think speeding up scalar complex multiplication is fine, but we should be aware that doing so isn’t the same thing as speeding up SIMD complex multiplication, which should be the goal for performance when we’re doing a large number of them.

With some work, I could probably make @turbo a bit more effecient in handling operations like these.

6 Likes