Hi,
I hope this hasn’t been discussed before, but has SIMD support for complex numbers been considered? There are several libraries in C/C++ taking advantage of this. There’s an example of implementation here and vectorized complex numbers appear in projects like spiral for FFTs and others (can’t link due to new user). Performance gains can be pretty significant, especially if it can be combined with other SIMD abilities, and most general-purpose CPUs support vectorization at this point.
Welcome! FWIW, please don’t link to copyrighted material without an open license without warning people about it
/*
* Copyright (C) 2006 Intel Corporation. All rights reserved.
*
* The information and source code contained herein is the exclusive
* property of Intel Corporation and may not be disclosed, examined,
* or reproduced in whole or in part without explicit written
* authorization from the Company.
*
* [Description]
* This code sample demonstrates the use of C in comparison with SSE2
* and SSE3 instrinsics to multiply two complex numbers.
*
* [Compile]
* icc double_complex.c (linux)
* icl double_complex.c (windows)
*
* [Output]
* Complex Product(C): 23.00+ -2.00i
* Complex Product(SSE3): 23.00+ -2.00i
* Complex Product(SSE2): 23.00+ -2.00i
*/
It’s not that it has not been considered - far from it! It’s just that there’s usually no need to take care of this explicitly in the core language (and a lot of code using complex numbers already uses SIMD through LLVM just fine). As such, I’ve moved the topic from “Interals & Design” to “Usage”.
You can check out e.g. LoopVectorization.jl and the JuliaSIMD Github Org for packages that aid in using SIMD in julia code.
Sorry about the copyright! I’ll keep it in mind for the future, and I appreciate the recategorization.
Also, it seems like my original post linked to the same thing twice, so I corrected that. The problem that I’m suggesting though is this behavior–
julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 11th Gen Intel(R) Core(TM) i5-11600K @ 3.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)
julia> x = 1.5 + 2.3im
1.5 + 2.3im
julia> y = 5.2 + 0.1im
5.2 + 0.1im
julia> @code_llvm x*y
; @ complex.jl:277 within `*'
; Function Attrs: uwtable
define void @"julia_*_513"([2 x double]* noalias nocapture sret %0, [2 x double]* nocapture nonnull readonly align 8 dereferenceable(16) %1, [2 x double]* nocapture nonnull readonly align 8 dereferenceable(16) %2) #0 {
top:
; ┌ @ complex.jl:63 within `real'
; │┌ @ Base.jl:33 within `getproperty'
%3 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 0
%4 = getelementptr inbounds [2 x double], [2 x double]* %2, i64 0, i64 0
; └└
; @ complex.jl:277 within `*' @ float.jl:332
%5 = load double, double* %3, align 8
%6 = load double, double* %4, align 8
%7 = fmul double %5, %6
; @ complex.jl:277 within `*'
; ┌ @ complex.jl:76 within `imag'
; │┌ @ Base.jl:33 within `getproperty'
%8 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 1
%9 = getelementptr inbounds [2 x double], [2 x double]* %2, i64 0, i64 1
; └└
; @ complex.jl:277 within `*' @ float.jl:332
%10 = load double, double* %8, align 8
%11 = load double, double* %9, align 8
%12 = fmul double %10, %11
; @ complex.jl:277 within `*'
; ┌ @ float.jl:329 within `-'
%13 = fsub double %7, %12
; └
; @ complex.jl:277 within `*' @ float.jl:332
%14 = fmul double %11, %5
%15 = fmul double %10, %6
; @ complex.jl:277 within `*'
; ┌ @ float.jl:326 within `+'
%16 = fadd double %14, %15
; └
%.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
store double %13, double* %.sroa.0.0..sroa_idx, align 8
%.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 1
store double %16, double* %.sroa.2.0..sroa_idx1, align 8
ret void
}
There are four multiplications here and two additions (as you would see in any naive implementation), which is fine, but with vectorization you can accelerate this reasonably, even without putting things in arrays or loops.
How would you vectorize a single complex multiplication? I think once you consider the cost of repacking, it would be more expensive.
That’s what I’m curious about. I’d love to mock up a version, but I’m hampered by my lack of knowledge of LLVM. However, I have a working version of complex multiplication in C/C++ using those intrinsics from a library I’m working on. Here’s a link to a library I’ve worked on (MIT License), where pack<double, 2>::type
is __m128d
in AVX intrinsic notation. I genuinely don’t know if packing/unpacking would make it more expensive, but I suspect it’s cheaper because (at least in GCC) the __m128d
is really just an aligned double*
with 2 elements so I don’t think LLVM would have to move anything. If I’m not mistaken, the LLVM code above already aligns the memory anyway (though, again, I don’t read any LLVM).
Here are some benchmarks for the current implementation:
julia> using BenchmarkTools
julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BechmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.400 ns … 44.200 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.600 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.859 ns ± 1.824 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▂▃ ▃ ▁
▇████▆██▆▄▄▄▅▆▄▄▄▄▅▅▅▃▄▄▄▄▁▃▃▃▄▁▃▁▃▄▃▃▄▄▃▃▄▅▃▃▃▃▄▃▄▄▄▃▁▄▁▃ █
3.4 ns Histogram: log(frequency) by time 9.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BechmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.800 ns … 40.400 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.038 ns ± 1.631 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
██▄ ▂▃ ▁ ▂
███▁██▁█▆▅▁▆▅▁▅▅▅▁▆▆▁▅▅▁▅▁▅▁▄▄▁▄▃▁▁▄▄▁▄▄▁▅▄▄▁▃▄▁▄▄▄▁▄▅▁▁▄▃ █
2.8 ns Histogram: log(frequency) by time 6.9 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
So if you manage to get that faster (for a single multiplication!) I’d be very interested in how you got vector (un)packing faster than that.
For complex matmuls, julia already calls BLAS, so the single mul benchmark is not representative for that (or any other more complex code). If you multiply a bunch of aligned complex values, julia may SIMD already anyway, even if the single multiplication is not SIMDed optimally.
You seem to be right, though I’m curious about the results if you don’t have to pack/unpack (i.e. if things were stored in the vectorized type to start). Here’s what I used to compare-
using Pkg
Pkg.activate(temp=true)
Pkg.add(["SIMD", "BenchmarkTools"])
using SIMD, BenchmarkTools
import Base: *
@benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
@benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
function *(x::Complex{T}, y::Complex{T}) where T<:AbstractFloat
v = Vec{2,T}((-1.,1.))
xx = Vec{2, T}((x.re, x.im))
yy = Vec{2, T}((y.re, y.im))
cc = shufflevector(yy, Val((0,0)))
ba = shufflevector(xx, Val((1,0)))
dd = shufflevector(yy, Val((1,1)))
dba = flipsign(ba*dd,v)
res = muladd(xx, cc, dba)
Complex(res[1], res[2])
end
@benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
@benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
This gave results (on my computer) of
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 0.800 ns … 27.700 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 0.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 0.888 ns ± 0.289 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▂
0.8 ns Histogram: frequency by time 1.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.000 ns … 7.200 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.100 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.084 ns ± 0.126 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▂
1 ns Histogram: frequency by time 1.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
and
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.800 ns … 31.000 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.878 ns ± 0.361 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▂
1.8 ns Histogram: frequency by time 2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.800 ns … 17.100 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.886 ns ± 0.208 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ █ ▃ ▁
█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▄ █
1.8 ns Histogram: log(frequency) by time 2.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
But if I measure the time to actually pack, it’s pretty sizeable
julia> f(x::Complex{T}, y::Complex{T}) where T<:Union{Float32,Float64} = (Vec{2,T}((x.re,x.im)), Vec{2,T}((y.re, y.im)))
f (generic function with 1 method)
julia>
julia> @benchmark f(x,y) setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.200 ns … 28.800 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.200 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.253 ns ± 0.383 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▅
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▂
1.2 ns Histogram: frequency by time 1.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f(x,y) setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 0.800 ns … 2.700 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 0.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 0.941 ns ± 0.126 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄ █ ▃ ▁
█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂ ▂
0.8 ns Histogram: frequency by time 1.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Some notes here though-- my CPU is theoretically 3.90GHz, pinning one clock cycle at around 0.25ns. I’m not qualified to get into whether I’m going over that or what the CPU is actually executing in one clock cycle, but I think it’s safe to distrust these benchmarks’ exact accuracy. There’s some further exploration I’d love to take on whenever I get the time, but I thought I’d document this case study for others and see if anyone else might have thoughts. I’d also love to save a few clock cycles and tap into the LLVM intrinsic for fmaddsub
instead of having to create v
just to use flipsign
, but llvmcall
gave me some trouble so I had to give up.
I can add it to VectorizationBase.jl. You can file an issue there so I don’t forget.
I put up an issue in SIMD.jl since that seemed like the more pertinent package for this instruction. It seems I missed an issue in that repo talking about Complex numbers that seemed to fizzle out, so I guess this was fate.
Reducing Packing Overhead in Matrix-Matrix Multiplication
@blackeneth I actually linked that third one in my original post . I’ve done this before in C/C++ with AVX, but I haven’t used LLVM before (otherwise I’d be using those intrinsics). The packing/unpacking is a bit simplistic in this example since I’m not doing any GEMM (so I don’t have to worry about caching/paging), it’s just loading up the real and imaginary parts of the complex numbers into one Vec{N,T}
. What I would do if it were more than one complex number would definitely be something like vload(reinterpret(Float64, StructArray(v)))
, but that’s probably another topic for another day.
For what it’s worth, your modified *
is a lot slower on my machine (code copy-pasted from above):
julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.400 ns … 27.300 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.600 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.905 ns ± 1.748 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▃▄ ▁
████▆▅▆█▇▇▇▆▆▇▆▅▆▅▅▅▄▅▄▅▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▃▁▃▃▁▆ █
3.4 ns Histogram: log(frequency) by time 16.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.800 ns … 37.500 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.937 ns ± 0.963 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ ▁ ▁ ▂
█▁▁█▁▁█▁▁█▁▁█▁▁▁█▁▁▅▁▁▄▁▁▆▁▁▁▅▁▁▆▁▁▅▁▁▇▁▁▆▁▁▁▇▁▁▆▁▁▅▁▁▄▁▁▃ █
2.8 ns Histogram: log(frequency) by time 4.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> function *(x::Complex{T}, y::Complex{T}) where T<:AbstractFloat
v = Vec{2,T}((-1.,1.))
xx = Vec{2, T}((x.re, x.im))
yy = Vec{2, T}((y.re, y.im))
cc = shufflevector(yy, Val((0,0)))
ba = shufflevector(xx, Val((1,0)))
dd = shufflevector(yy, Val((1,1)))
dba = flipsign(ba*dd,v)
res = muladd(xx, cc, dba)
Complex(res[1], res[2])
end
* (generic function with 371 methods)
julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 4.800 ns … 99.900 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.421 ns ± 2.735 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▆▄▃▁▁ ▁
██████████▇▆▆▇▆▇▆▅▆▆▄▅▆▅▄▄▃▄▄▃▁▄▁▃▃▁▃▁▁▁▃▃▃▁▃▃▄▁▄▃▃▁▁▁▄▆▆▇ █
4.8 ns Histogram: log(frequency) by time 18.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark x*y setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 4.200 ns … 56.600 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.300 ns ┊ GC (median): 0.00%
Time (mean ± σ): 4.504 ns ± 1.648 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁█▅ ▁ ▂ ▁
███▇█▅██▁▆▅▄▄▃▄▃▁▃▃▄▄▅▄▄▁▅▄▄▃▄▄▃▄▁▄▅▄▄▁▄▄▁▄▄▃▄▄▄▃▁▄▄▃▃▄▃▄▄ █
4.2 ns Histogram: log(frequency) by time 9.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Probably caused by (un)packing being much slower:
julia> f(x::Complex{T}, y::Complex{T}) where T<:Union{Float32,Float64} = (Vec{2,T}((x.re,x.im)), Vec{2,T}((y.re, y.im)))
f (generic function with 1 method)
julia> @benchmark f(x,y) setup=(x=rand(ComplexF32);y=rand(ComplexF32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.100 ns … 58.600 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.100 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.347 ns ± 1.833 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▆▂ ▃▂ ▁
████▁██▆▁▅▄▄▃▁▅▅▅▁▆▇▆▅▁▅▅▅▁▄▅▄▁▁▃▄▁▁▃▅▅▄▁▃▃▄▁▅▄▄▄▁▅▁▄▁▃▃▁▃ █
3.1 ns Histogram: log(frequency) by time 7.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f(x,y) setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.800 ns … 45.000 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.900 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.049 ns ± 1.177 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▆ █ ▆ ▂ ▁ ▂
█▁█▁█▁█▁█▁▁█▁█▁▇▁▆▁▁▅▁▅▁▆▁▇▁█▁▁█▁█▁█▁▇▁▁█▁▇▁▆▁▅▁▁▅▁▆▁▄▁▄▁▅ █
2.8 ns Histogram: log(frequency) by time 5.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Native code comparison
Your multiply:
julia> @code_native debuginfo=:none rand(ComplexF64)*rand(ComplexF64)
.text
movq %rdi, %rax
vmovupd (%rsi), %xmm0
vmovddup (%rdx), %xmm1 # xmm1 = mem[0,0]
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmovddup 8(%rdx), %xmm3 # xmm3 = mem[0,0]
vmulpd %xmm3, %xmm2, %xmm2
movabsq $.rodata.cst16, %rcx
vxorpd (%rcx), %xmm2, %xmm3
vblendpd $1, %xmm3, %xmm2, %xmm2 # xmm2 = xmm3[0],xmm2[1]
vfmadd231pd %xmm1, %xmm0, %xmm2 # xmm2 = (xmm0 * xmm1) + xmm2
vmovupd %xmm2, (%rdi)
retq
nopl (%rax,%rax)
julia>
Original multiply:
$ julia -q # new session
julia> @code_native debuginfo=:none rand(ComplexF64)*rand(ComplexF64)
.text
movq %rdi, %rax
vmovsd (%rsi), %xmm0 # xmm0 = mem[0],zero
vmovsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
vmovsd (%rdx), %xmm2 # xmm2 = mem[0],zero
vmovsd 8(%rdx), %xmm3 # xmm3 = mem[0],zero
vmulsd %xmm2, %xmm0, %xmm4
vmulsd %xmm3, %xmm1, %xmm5
vsubsd %xmm5, %xmm4, %xmm4
vmulsd %xmm0, %xmm3, %xmm0
vmulsd %xmm2, %xmm1, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
vmovsd %xmm4, (%rdi)
vmovsd %xmm0, 8(%rdi)
retq
nopw (%rax,%rax)
julia> versioninfo()
Julia Version 1.7.0-beta3
Commit e76c9dad42 (2021-07-07 08:12 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, skylake)
Environment:
JULIA_PKG_SERVER =
JULIA_NUM_THREADS = 4
So yeah, I know it’s tempting to lose yourself in these kinds of nanobenchmarks, but you’ve got to keep in mind that running a single complex multiply will not matter in the grand scheme of things. There are too many variables in play when that multiply is surrounded by other pieces of code (which may compile to completely different assembly as e.g. julia or LLVM notices that you’re doing the same kinds of multiplies over and over and SIMDs them smarter). Of course that’s not to say that LLVM or julia are perfect at this (LoopVectorization is the example that there often is still performance left on the table) but it’s very situational.
I don’t think SIMD.jl is more pertinent than VectorizationBase.jl for this instruction, so I went ahead and added it in this commit.
It also uses feature detection to fall back on a version based on shuffles for CPUs without FMA.
Examples:
julia> vx2 = Vec(ntuple(_ -> rand(), Val(2))...)
Vec{2, Float64}<0.37237382789924234, 0.6382509798796134>
julia> vy2 = Vec(ntuple(_ -> rand(), Val(2))...)
Vec{2, Float64}<0.7272098000262965, 0.7430448048335492>
julia> vz2 = Vec(ntuple(_ -> rand(), Val(2))...)
Vec{2, Float64}<0.011526509562512555, 0.7452696191090189>
julia> VectorizationBase.vfmaddsub(vx2,vy2,vz2)
Vec{2, Float64}<0.259267387359122, 1.2195186938884879>
julia> @cn VectorizationBase.vfmaddsub(vx2,vy2,vz2)
.text
mov rax, rdi
vmovapd xmm0, xmmword ptr [rsi]
vmovapd xmm1, xmmword ptr [rdx]
vfmaddsub213pd xmm1, xmm0, xmmword ptr [rcx] # xmm1 = (xmm0 * xmm1) +/- mem
vmovapd xmmword ptr [rdi], xmm1
ret
nop word ptr cs:[rax + rax]
julia> vx = Vec(ntuple(_ -> rand(), pick_vector_width(Float64))...)
Vec{8, Float64}<0.0371378438550225, 0.6101315964217398, 0.18702462021703203, 0.06051847376330244, 0.31747377259357323, 0.09423184857609468, 0.3576888945419936, 0.2587005195532206>
julia> vy = Vec(ntuple(_ -> rand(), pick_vector_width(Float64))...)
Vec{8, Float64}<0.10039035844487065, 0.7621525414057622, 0.9114039069105686, 0.11942843942193893, 0.9797458325416961, 0.8660491193722564, 0.48745801987314885, 0.8677852992519912>
julia> vz = Vec(ntuple(_ -> rand(), pick_vector_width(Float64))...)
Vec{8, Float64}<0.021390827947836533, 0.7750678328506346, 0.655633872215031, 0.789299661365898, 0.1665200857675322, 0.09574529244474994, 0.15962217583127325, 0.13983981758371034>
julia> VectorizationBase.vfmaddsub(vx,vy,vz)
Vec{8, Float64}<-0.017662546491361186, 1.2400811796554183, -0.4851789026607627, 0.7965272882436468, 0.14452351987231132, 0.17735470192089656, 0.014736144432782509, 0.3643363253608475>
julia> @cn VectorizationBase.vfmaddsub(vx,vy,vz)
.text
mov rax, rdi
vmovupd zmm0, zmmword ptr [rsi]
vmovupd zmm1, zmmword ptr [rdx]
vfmaddsub213pd zmm1, zmm0, zmmword ptr [rcx] # zmm1 = (zmm0 * zmm1) +/- mem
vmovapd zmmword ptr [rdi], zmm1
vzeroupper
ret
nop
Interesting- sorry about the hassle. I only did that since SIMD.jl already had fmadd, so fmaddsub seemed to naturally extend. I appreciate the help!
I’ve now added vfmsubadd
aswell.
VectorizationBase.jl, like SIMD.jl, wraps a lot of LLVM intrinsics.
It also provides an API around some hardware info (through HWloc.jl and parsing LLVMGetHostCPUFeatures
) so code can use hardware-specific features correctly.
VectorizationBase’s primary purpose is to serve LoopVectorization.jl.
It could use some more documentation and unit tests.
Thanks to @Elrod I’m basically tying normal complex arithmetic on my system. I’d love to see how this performs on other’s systems though.
julia> using VectorizationBase: Vec, shufflevector, vfmaddsub, mul_ieee
julia> function my_mul(x::Complex{T}, y::Complex{T}) where T<:AbstractFloat
xx = Vec(x.re,x.im)
yy = Vec(y.re,y.im)
cc = shufflevector(yy, Val((0,0)))
ba = shufflevector(xx, Val((1,0)))
dd = shufflevector(yy, Val((1,1)))
dba = mul_ieee(ba,dd)
res = vfmaddsub(xx,cc,dba)
Complex{T}(res(1), res(2))
end
my_mul (generic function with 1 method)
julia> @benchmark x*y setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.200 ns … 15.300 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.300 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.265 ns ± 0.176 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ █
█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂ ▂
1.2 ns Histogram: frequency by time 1.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark my_mul(x,y) setup=(x=rand(ComplexF64);y=rand(ComplexF64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.200 ns … 16.300 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.200 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.252 ns ± 0.164 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▆
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
1.2 ns Histogram: frequency by time 1.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
This doesn’t actually work for ComplexF32
because there is no 2-pack for 32-bit floats. I’ll probably look more into a fully fleshed out type after work. I also definitely agree with your point about how it’s easy to lose the forest for the trees (I’ve certainly been known to do that), and I also know that Julia’s stdlib isn’t guaranteeing the fastest possible implementation of anything. I wanted to start a conversation on how I would do this, and I really appreciate the help and feedback from everyone in the thread. However, I still somewhat disagree that a single complex multiply doesn’t matter in the grand scheme of things-- the multiplication is the foundation and bottleneck for any arithmetic algorithm with complex numbers, and speeding it up will speed up everything that builds on top of it. That being said, for day-to-day work, I certainly agree that the built in complex multiplication is more than enough
@danny.s VectorizationBase.jl is cool because it has by far the most user-friendly frontend in LoopVectorization.jl’s @turbo
and @tturbo
– it’s worth checking out if you haven’t yet
IIRC there is maybe even some possibility of @turbo
supporting operations on isbits
structs in the future?
I disagree. My point is precisely that the single multiplication may not be the most efficient building block when e.g. writing a matrix multiply (especially when the single multiplication is already written in a complex manner with intrinsics which may be opaque to the compiler) or multiple values in a loop. This in turn may lead to it being unable to fuse multiple multiplications.
If you’re looking at a single multiplication, yes it’s faster, but the difference is on the order of noise when examined in context with other code surrounding that multiply.
I’m sure you’re aware of it, but often it’s better to avoid repeated un- and repacking of values to vector registers and instead do as little packing with as many values at the same time as possible.
I won’t be in reach of my laptop for a few hours, but I’ll nevertheless be happy to benchmark your code on my machine once I’m back.
I think speeding up scalar complex multiplication is fine, but we should be aware that doing so isn’t the same thing as speeding up SIMD complex multiplication, which should be the goal for performance when we’re doing a large number of them.
With some work, I could probably make @turbo
a bit more effecient in handling operations like these.