Compiler optimizations for ComplexF64 vs Fortran

My former colleague Michael Rutter now posted his comparison of Fortran compilers regarding their optimization of complex number arithmetic. I was wondering if anyone would like to see how Julia matches these?

For example, the trivial implementation to the conjugation snippet
g(v) = map!(conj, v, v)
gives on my laptop (likely significantly underpowered compared to MR’s test machine) about 0.5 ns per element, which is somewhat behind the pack, but is within the right ball park. (I’m setting aside that using e.g. StructArrays.jl gains an order of magnitude of performance.)

However, the multiplication implementation
h(x,y,z) = map!(*, z, x, y)
gives me about 1 ns per element which is more in line with the other compilers. To my surprise, an explicit for loop is not equivalent, and gives an even better 0.7 ns per element. (I here store results in a third array since the repeated multiplication in-place caused by @btime leads to NaNs which slow the execution down considerably.)

Would it be interesting to see how Julia compares with the other benchmarks and/or compiled code?

I think what Fortran compilers do is generally closer to Julia with @fastmath enabled.

In particular, there might be a difference with division.
From gcc’s documentation:

-fcx-fortran-rules

Complex multiplication and division follow Fortran rules. Range reduction is done as part of complex division, but there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.

The default is -fno-cx-fortran-rules.

I’m just assuming that Fortran follows Fortran rules, though (while other languages compiled with GCC default to not doing so).

I assume we have these checks in Julia (except under fastmath)?

Here the scale_r benchmark runs, with gfortran, in ~2.5s. That is very much dependent on on the -march=native (without it it goes to ~6s), suggesting that loop vectorization is playing in important role here.

The corresponding Julia code:

Code
julia> c = fill(ComplexF64(0.99999999,2.0),1000);

julia> function bench_scale_r(c,a; rep=10^7+1)
           for i in 1:rep
               for j in eachindex(c)
                   @inbounds c[j] = c[j]*a
               end
           end
           return c[7]
       end
bench_scale_r (generic function with 1 method)

julia> @time bench_scale_r(c,a)
  3.928719 seconds (4.79 k allocations: 330.579 KiB, 0.31% compilation time)
0.9900498231336597 + 1.9800996660683166im

julia> @time bench_scale_r(c,a)
  3.919136 seconds (1 allocation: 32 bytes)
0.9801986620889829 + 1.9603973437819382im

julia> c2 = StructArray(c);

julia> @time bench_scale_r(c2,a)
  1.013725 seconds (6.46 k allocations: 486.307 KiB, 1.21% compilation time)
0.9704455217415278 + 1.9408910628919662im

julia> @time bench_scale_r(c2,a)
  0.998821 seconds (1 allocation: 32 bytes)
0.9607894267689606 + 1.9215788727537109im

Executes in ~4s, and in ~1s with StructArrays. I guess that the fortran compiler can do some assumptions about the structure of the arrays that the Julia function cannot, given the static character of the arrays in the former, and the loop-vectorize better in the case of the array of structs?

(I tested @fastmath, it has no effect on the execution time).

One can check the fortran performance by compiling:

Code
program st
  implicit none
  complex(kind(1d0)), allocatable::c(:)
  real (kind(1d0))::a
  integer :: i,n,rep1, j
  real t1,t2,t
  n=1000
  allocate(c(n))

  a=0.999999999d0
  c=(0.99999999d0,2d0)
  rep1=1e7+1
  
  call cpu_time(t1)
  do i=1,rep1
      do j=1,n
         c(j)=a*c(j)
      enddo
  enddo
  call cpu_time(t2)

  write(*,*)c(7)
  write(*,*)'Time=',t2-t1
end

with gfortran -O3 -march=native teste.f90 -o test

Which is one example slightly simpler than the ones provided in that page.

The codes generated by gfortran and Julia are very different: Compiler Explorer

and

@code_native
julia> @code_native bench_scale_r(c,a)
	.text
	.file	"bench_scale_r"
	.globl	julia_bench_scale_r_634         # -- Begin function julia_bench_scale_r_634
	.p2align	4, 0x90
	.type	julia_bench_scale_r_634,@function
julia_bench_scale_r_634:                # @julia_bench_scale_r_634
; ┌ @ REPL[13]:1 within `bench_scale_r`
	.cfi_startproc
# %bb.0:                                # %top
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	pushq	%rbx
	subq	$24, %rsp
	.cfi_offset %rbx, -24
	movq	%rsi, %rdx
	movq	%rdi, %rbx
	movabsq	$"j_#bench_scale_r#14_636", %rax
	leaq	-24(%rbp), %rdi
	movl	$10000001, %esi                 # imm = 0x989681
	callq	*%rax
	vmovups	-24(%rbp), %xmm0
	vmovups	%xmm0, (%rbx)
	movq	%rbx, %rax
	addq	$24, %rsp
	popq	%rbx
	popq	%rbp
	.cfi_def_cfa %rsp, 8
	retq
.Lfunc_end0:
	.size	julia_bench_scale_r_634, .Lfunc_end0-julia_bench_scale_r_634
	.cfi_endproc
; └
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

julia> 

For if someone understands that and wants to look at them.

If a performance difference arises in these sorts of nanobenchmarks, it’s usually useful to look at the generated assembly.

The original post mentions conjugation and multiplication in particular. Neither of these usually have special cases that one could handle or ignore across different degrees of optimization.

The only thing I can imagine making a performance difference in conjugation is the loop overhead (most likely the unrolling factor). As for multiplication, I would expect the most likely driver of meaningful performance differences to arise from the use or lack of fma instructions*, with a modest contribution from the degree of loop unrolling. There’s some chance that vectorization is missed, but this should only explain slowdowns approaching 2x or more.

*Note: I do not advocate for the use of fma in complex multiplication because it is almost impossible to use (with performance benefit) while retaining properties like imag(x * conj(x)) == 0.

I had inspected the assembly code produced by the Julia compiler relative to that resulting from (i) Fortran via Godbolt, and (ii) in Julia when using a StructArray. In (ii), the conjugation of each complex number is done via xor’ing elements in the imaginary array, which is efficiently vectorizable. In the fortran code (i), the equivalent operation seems to be implemented by xor’ing each element in the complex array with the right 128 bit mask to flip the sign only in the imaginary part. However, the implementation over a regular complex array in Julia seems to include a lot of unnecessary data shuffles between, whose purpose is not really clear to me. Frustratingly, even when I attempted to implement the same approach as (i) by reinterpreting the array as an array of Int128 and xor’ing them appropriately, the Julia compiler produced the same suboptimal reshuffling code. I haven’t tested the approach of reinterpreting to Int64 or Float64 and xoring/negating every other element. Is there any idea how to trace down why those reshufflings are occurring? Might they be due to the LLVM stage rather than the Julia compilation?

Indeed, I spent a bit of time chasing this too. I created this benchmark:

# v1.8.0
x = rand(ComplexF32,1000); y = similar(x);
u = similar(x, UInt64); v = similar(u);
conjmask = reinterpret(UInt64,[conj(zero(ComplexF32))]) |> only
reinterpcopy!(y,x) = copy!(y,reinterpret(eltype(y),x))

using BenchmarkTools
@btime broadcast!(conj,$y,$x);
#  679.221 ns (0 allocations: 0 bytes)
@btime broadcast!(xor,$v,$u,$conjmask);
#  65.814 ns (0 allocations: 0 bytes)
@btime reinterpcopy!($y, broadcast!(xor,$v,reinterpcopy!($u,$x),$conjmask));
#  676.923 ns (0 allocations: 0 bytes)

The first two tests do the exact same bit transformations to the input arrays, although the UInt version manages to SIMD and unroll and achieves a 10x speedup.

I tried to write a version using reinterpret(UInt64, x::ComplexF32) per-element, but #42968.

Instead, the third one bit-copies the Complex array into the UInt array, does the operation between the UInt arrays, then copies the result back to a Complex array. It takes the same amount of time as the pure float version even with two extra unfused copies.


I don’t have the expertise to say whether this is a Julia issue or an LLVM issue. The relevant definition is conj(z::Complex) = Complex(real(z),-imag(z)), but I’d be loath to write that any other way within Julia.

The @code_llvm does maybe suggest that Julia could generate better LLVM?

@code_llvm conj(2.0im)
;  @ complex.jl:276 within `conj`
define void @julia_conj_7225([2 x double]* noalias nocapture sret([2 x double]) %0, [2 x double]* nocapture
 nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
; ┌ @ complex.jl:72 within `real`
; │┌ @ Base.jl:38 within `getproperty`
    %2 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 0
; └└
; ┌ @ complex.jl:87 within `imag`
; │┌ @ Base.jl:38 within `getproperty`
    %3 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 1
; └└
; ┌ @ float.jl:381 within `-`
   %4 = load double, double* %3, align 8
   %5 = fneg double %4
; └
; ┌ @ complex.jl:14 within `Complex` @ complex.jl:14
   %6 = load double, double* %2, align 8
; └
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
  store double %6, double* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 1
  store double %5, double* %.sroa.2.0..sroa_idx1, align 8
  ret void
}
@code_llvm -(2.0im)
;  @ complex.jl:287 within `-`
define void @julia_-_7227([2 x double]* noalias nocapture sret([2 x double]) %0, [2 x double]* nocapture nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
;  @ complex.jl:287 within `-` @ float.jl:381
  %2 = bitcast [2 x double]* %1 to <2 x double>*
  %3 = load <2 x double>, <2 x double>* %2, align 8
  %4 = fneg <2 x double> %3
;  @ complex.jl:287 within `-`
  %5 = bitcast [2 x double]* %0 to <2 x double>*
  store <2 x double> %4, <2 x double>* %5, align 8
  ret void
}

Notice the getelementptr business that is generated for conj but not -. That may be getting in the way of proper optimization? The @code_native seems to reflect this difference.

I don’t have significant knowledge or understanding of LLVM, so can’t say whether Julia’s generated LLVM is truly the issue. But whether it’s Julia or LLVM, there’s definitely a potential performance improvement that could come out of this. Surely, there are many other cases where we’re similarly suboptimal.


EDIT: The plot thickens. Duplicating the definition of conj seems to be giving a signficant (but still partial) improvement. What’s going wrong with the normal one?

# v1.8.0
conj_duplicate(z::Complex) = Complex(real(z),-imag(z))
@btime broadcast!(conj_duplicate,$y,$x);
#  447.449 ns (0 allocations: 0 bytes)
2 Likes

@mikmoore Interesting! I suppose the fact the third implementation with all its copying takes almost exactly the same amount of time as the vanilla one, hint that the origin of the slowness in the vanilla implementation is roughly the same number of unnecessary data shuffling around?