Indeed, I spent a bit of time chasing this too. I created this benchmark:
# v1.8.0
x = rand(ComplexF32,1000); y = similar(x);
u = similar(x, UInt64); v = similar(u);
conjmask = reinterpret(UInt64,[conj(zero(ComplexF32))]) |> only
reinterpcopy!(y,x) = copy!(y,reinterpret(eltype(y),x))
using BenchmarkTools
@btime broadcast!(conj,$y,$x);
# 679.221 ns (0 allocations: 0 bytes)
@btime broadcast!(xor,$v,$u,$conjmask);
# 65.814 ns (0 allocations: 0 bytes)
@btime reinterpcopy!($y, broadcast!(xor,$v,reinterpcopy!($u,$x),$conjmask));
# 676.923 ns (0 allocations: 0 bytes)
The first two tests do the exact same bit transformations to the input arrays, although the UInt
version manages to SIMD and unroll and achieves a 10x speedup.
I tried to write a version using reinterpret(UInt64, x::ComplexF32)
per-element, but #42968.
Instead, the third one bit-copies the Complex
array into the UInt
array, does the operation between the UInt
arrays, then copies the result back to a Complex
array. It takes the same amount of time as the pure float version even with two extra unfused copies.
I don’t have the expertise to say whether this is a Julia issue or an LLVM issue. The relevant definition is conj(z::Complex) = Complex(real(z),-imag(z))
, but I’d be loath to write that any other way within Julia.
The @code_llvm
does maybe suggest that Julia could generate better LLVM?
@code_llvm conj(2.0im)
; @ complex.jl:276 within `conj`
define void @julia_conj_7225([2 x double]* noalias nocapture sret([2 x double]) %0, [2 x double]* nocapture
nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
; ┌ @ complex.jl:72 within `real`
; │┌ @ Base.jl:38 within `getproperty`
%2 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 0
; └└
; ┌ @ complex.jl:87 within `imag`
; │┌ @ Base.jl:38 within `getproperty`
%3 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 1
; └└
; ┌ @ float.jl:381 within `-`
%4 = load double, double* %3, align 8
%5 = fneg double %4
; └
; ┌ @ complex.jl:14 within `Complex` @ complex.jl:14
%6 = load double, double* %2, align 8
; └
%.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
store double %6, double* %.sroa.0.0..sroa_idx, align 8
%.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 1
store double %5, double* %.sroa.2.0..sroa_idx1, align 8
ret void
}
@code_llvm -(2.0im)
; @ complex.jl:287 within `-`
define void @julia_-_7227([2 x double]* noalias nocapture sret([2 x double]) %0, [2 x double]* nocapture nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
; @ complex.jl:287 within `-` @ float.jl:381
%2 = bitcast [2 x double]* %1 to <2 x double>*
%3 = load <2 x double>, <2 x double>* %2, align 8
%4 = fneg <2 x double> %3
; @ complex.jl:287 within `-`
%5 = bitcast [2 x double]* %0 to <2 x double>*
store <2 x double> %4, <2 x double>* %5, align 8
ret void
}
Notice the getelementptr
business that is generated for conj
but not -
. That may be getting in the way of proper optimization? The @code_native
seems to reflect this difference.
I don’t have significant knowledge or understanding of LLVM, so can’t say whether Julia’s generated LLVM is truly the issue. But whether it’s Julia or LLVM, there’s definitely a potential performance improvement that could come out of this. Surely, there are many other cases where we’re similarly suboptimal.
EDIT: The plot thickens. Duplicating the definition of conj
seems to be giving a signficant (but still partial) improvement. What’s going wrong with the normal one?
# v1.8.0
conj_duplicate(z::Complex) = Complex(real(z),-imag(z))
@btime broadcast!(conj_duplicate,$y,$x);
# 447.449 ns (0 allocations: 0 bytes)