Indeed, I spent a bit of time chasing this too. I created this benchmark:
# v1.8.0
x = rand(ComplexF32,1000); y = similar(x);
u = similar(x, UInt64); v = similar(u);
conjmask = reinterpret(UInt64,[conj(zero(ComplexF32))]) |> only
reinterpcopy!(y,x) = copy!(y,reinterpret(eltype(y),x))
using BenchmarkTools
@btime broadcast!(conj,$y,$x);
#  679.221 ns (0 allocations: 0 bytes)
@btime broadcast!(xor,$v,$u,$conjmask);
#  65.814 ns (0 allocations: 0 bytes)
@btime reinterpcopy!($y, broadcast!(xor,$v,reinterpcopy!($u,$x),$conjmask));
#  676.923 ns (0 allocations: 0 bytes)
The first two tests do the exact same bit transformations to the input arrays, although the UInt version manages to SIMD and unroll and achieves a 10x speedup.
I tried to write a version using reinterpret(UInt64, x::ComplexF32) per-element, but #42968.
Instead, the third one bit-copies the Complex array into the UInt array, does the operation between the UInt arrays, then copies the result back to a Complex array. It takes the same amount of time as the pure float version even with two extra unfused copies.
I don’t have the expertise to say whether this is a Julia issue or an LLVM issue. The relevant definition is conj(z::Complex) = Complex(real(z),-imag(z)), but I’d be loath to write that any other way within Julia.
The @code_llvm does maybe suggest that Julia could generate better LLVM?
@code_llvm conj(2.0im)
;  @ complex.jl:276 within `conj`
define void @julia_conj_7225([2 x double]* noalias nocapture sret([2 x double]) %0, [2 x double]* nocapture
 nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
; ┌ @ complex.jl:72 within `real`
; │┌ @ Base.jl:38 within `getproperty`
    %2 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 0
; └└
; ┌ @ complex.jl:87 within `imag`
; │┌ @ Base.jl:38 within `getproperty`
    %3 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 1
; └└
; ┌ @ float.jl:381 within `-`
   %4 = load double, double* %3, align 8
   %5 = fneg double %4
; └
; ┌ @ complex.jl:14 within `Complex` @ complex.jl:14
   %6 = load double, double* %2, align 8
; └
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
  store double %6, double* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 1
  store double %5, double* %.sroa.2.0..sroa_idx1, align 8
  ret void
}
 
@code_llvm -(2.0im)
;  @ complex.jl:287 within `-`
define void @julia_-_7227([2 x double]* noalias nocapture sret([2 x double]) %0, [2 x double]* nocapture nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
;  @ complex.jl:287 within `-` @ float.jl:381
  %2 = bitcast [2 x double]* %1 to <2 x double>*
  %3 = load <2 x double>, <2 x double>* %2, align 8
  %4 = fneg <2 x double> %3
;  @ complex.jl:287 within `-`
  %5 = bitcast [2 x double]* %0 to <2 x double>*
  store <2 x double> %4, <2 x double>* %5, align 8
  ret void
}
 
Notice the getelementptr business that is generated for conj but not -. That may be getting in the way of proper optimization? The @code_native seems to reflect this difference.
I don’t have significant knowledge or understanding of LLVM, so can’t say whether Julia’s generated LLVM is truly the issue. But whether it’s Julia or LLVM, there’s definitely a potential performance improvement that could come out of this. Surely, there are many other cases where we’re similarly suboptimal.
EDIT: The plot thickens. Duplicating the definition of conj seems to be giving a signficant (but still partial) improvement. What’s going wrong with the normal one?
# v1.8.0
conj_duplicate(z::Complex) = Complex(real(z),-imag(z))
@btime broadcast!(conj_duplicate,$y,$x);
#  447.449 ns (0 allocations: 0 bytes)