Possibly mildly faster sign for integers?

jishnub · April 15, 2025, 7:05am

Suggested in a stackoverflow answer on micro optimization - Fast sign of integer in C - Stack Overflow

julia> _sign(x) = (x > 0) - (x < 0)
_sign (generic function with 1 method)

Using this,

julia> v = 1:1_000; w = similar(v);

julia> @btime map!(sign, $w, $v);
  142.305 ns (0 allocations: 0 bytes)

julia> @btime map!(_sign, $w, $v);
  125.669 ns (0 allocations: 0 bytes)

I wonder why this version is faster than the Base implementation, which uses a similar logic? The Base implementation uses two ifelse checks rather than subtracting the results.

Oscar_Smith · April 15, 2025, 7:39am

what’s the difference in native code? I’m surprised LLVM isn’t able to prove that they’re the same

Sukera · April 15, 2025, 8:13am

On 1.11.2, godbolt.org shows one single test instruction for the microoptimized version, and two test instructions for the existing sign function, with two conditional moves. I did not check what effect this has on the vectorized versions that map! would end up using.

foobar_lv2 · April 15, 2025, 9:11am

Well spotted! It appears that people were under the misconception that LLVM would figure that out (I am surprised by that failure as well).

So we need an extra method for all signed bitintegers (Int8,Int16,Int32,Int64,Int128). Note

julia> @code_llvm sign(1)
; Function Signature: sign(Int64)
;  @ number.jl:162 within `sign`
define i64 @julia_sign_5068(i64 signext %"x::Int64") #0 {
top:
; ┌ @ essentials.jl:797 within `ifelse`
   %0 = call i64 @llvm.smin.i64(i64 %"x::Int64", i64 1)
   %1 = call i64 @llvm.smax.i64(i64 %0, i64 -1)
   ret i64 %1
; └
}

This does not make it obvious that the resulting machine code sucks on intel.

The other code is

julia> @code_llvm _sign(1)
; Function Signature: _sign(Int64)
;  @ REPL[3]:1 within `_sign`
define i64 @julia__sign_5070(i64 signext %"x::Int64") #0 {
top:
; ┌ @ bool.jl:167 within `-`
; │┌ @ boot.jl:954 within `Int64`
; ││┌ @ boot.jl:881 within `toInt64`
     %"x::Int64.lobit.neg" = ashr i64 %"x::Int64", 63
; │└└
; │ @ bool.jl:167 within `-` @ int.jl:86
   %isnotnull = icmp ne i64 %"x::Int64", 0
   %isnotnull.zext = zext i1 %isnotnull to i64
   %0 = or i64 %"x::Int64.lobit.neg", %isnotnull.zext
   ret i64 %0
; └

julia> @code_native _sign(1)
	.text
	.file	"_sign"
	.section	.ltext,"axl",@progbits
	.globl	julia__sign_5072                # -- Begin function julia__sign_5072
	.p2align	4, 0x90
	.type	julia__sign_5072,@function
julia__sign_5072:                       # @julia__sign_5072
; Function Signature: _sign(Int64)
; ┌ @ REPL[3]:1 within `_sign`
# %bb.0:                                # %top
	#DEBUG_VALUE: _sign:x <- $rdi
	push	rbp
	mov	rbp, rsp
; │┌ @ bool.jl:167 within `-`
; ││┌ @ boot.jl:954 within `Int64`
; │││┌ @ boot.jl:881 within `toInt64`
	mov	rcx, rdi
	sar	rcx, 63
; ││└└
; ││ @ bool.jl:167 within `-` @ int.jl:86
	xor	eax, eax
	test	rdi, rdi
	setne	al
	or	rax, rcx
	pop	rbp
	ret
julia> @code_native sign(1)
	.text
	.file	"sign"
	.section	.ltext,"axl",@progbits
	.globl	julia_sign_5074                 # -- Begin function julia_sign_5074
	.p2align	4, 0x90
	.type	julia_sign_5074,@function
julia_sign_5074:                        # @julia_sign_5074
; Function Signature: sign(Int64)
; ┌ @ number.jl:162 within `sign`
# %bb.0:                                # %top
	#DEBUG_VALUE: sign:x <- $rdi
	push	rbp
	mov	rbp, rsp
; │┌ @ essentials.jl:797 within `ifelse`
	test	rdi, rdi
	mov	ecx, 1
	cmovle	rcx, rdi
	test	rcx, rcx
	mov	rax, -1
	cmovns	rax, rcx
	pop	rbp
	ret
.Lfunc_end0:
	.size	julia_sign_5074, .Lfunc_end0-julia_sign_5074
; └└
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

PS. I guess the critical part here is how this looks inside of a tight simd loop (which you are benchmarking with map!). The hot loop part for your fast version is

.LBB0_18:                               # %vector.body
                                        # =>This Inner Loop Header: Depth=1
; ││││ @ simdloop.jl:77 within `macro expansion` @ broadcast.jl:995
; ││││┌ @ broadcast.jl:616 within `getindex`
; │││││┌ @ broadcast.jl:620 within `_getindex`
; ││││││┌ @ broadcast.jl:671 within `_broadcast_getindex`
; │││││││┌ @ broadcast.jl:696 within `_getindex`
; ││││││││┌ @ broadcast.jl:665 within `_broadcast_getindex`
; │││││││││┌ @ essentials.jl:918 within `getindex`
	vmovdqu	ymm2, ymmword ptr [rax + 8*rdx]
	vmovdqu	ymm3, ymmword ptr [rax + 8*rdx + 32]
	vmovdqu	ymm4, ymmword ptr [rax + 8*rdx + 64]
	vmovdqu	ymm5, ymmword ptr [rax + 8*rdx + 96]
; │││││││└└└
; │││││││ @ broadcast.jl:672 within `_broadcast_getindex`
; │││││││┌ @ broadcast.jl:699 within `_broadcast_getindex_evalf`
; ││││││││┌ @ REPL[3]:1 within `_sign`
; │││││││││┌ @ bool.jl:167 within `-`
; ││││││││││┌ @ boot.jl:954 within `Int64`
; │││││││││││┌ @ boot.jl:881 within `toInt64`
	vpcmpgtq	ymm6, ymm0, ymm2
	vpcmpgtq	ymm7, ymm0, ymm3
	vpcmpgtq	ymm8, ymm0, ymm4
	vpcmpgtq	ymm9, ymm0, ymm5
; ││││││││││└└
; ││││││││││ @ bool.jl:167 within `-` @ int.jl:86
	vpcmpeqq	ymm2, ymm2, ymm0
	vpandn	ymm2, ymm2, ymm1
	vpor	ymm2, ymm6, ymm2
	vpcmpeqq	ymm3, ymm3, ymm0
	vpandn	ymm3, ymm3, ymm1
	vpor	ymm3, ymm7, ymm3
	vpcmpeqq	ymm4, ymm4, ymm0
	vpandn	ymm4, ymm4, ymm1
	vpor	ymm4, ymm8, ymm4
	vpcmpeqq	ymm5, ymm5, ymm0
	vpandn	ymm5, ymm5, ymm1
	vpor	ymm5, ymm9, ymm5
; ││││└└└└└└
; ││││┌ @ array.jl:986 within `setindex!`
; │││││┌ @ array.jl:991 within `_setindex!`
	vmovdqu	ymmword ptr [rcx + 8*rdx], ymm2
	vmovdqu	ymmword ptr [rcx + 8*rdx + 32], ymm3
	vmovdqu	ymmword ptr [rcx + 8*rdx + 64], ymm4
	vmovdqu	ymmword ptr [rcx + 8*rdx + 96], ymm5
; ││││└└
; ││││ @ simdloop.jl:78 within `macro expansion`
; ││││┌ @ int.jl:87 within `+`
	add	rdx, 16
	cmp	rsi, rdx
	jne	.LBB0_18

while the slower base version has

.LBB0_18:                               # %vector.body
                                        # =>This Inner Loop Header: Depth=1
; ││││ @ simdloop.jl:77 within `macro expansion` @ broadcast.jl:995
; ││││┌ @ broadcast.jl:616 within `getindex`
; │││││┌ @ broadcast.jl:620 within `_getindex`
; ││││││┌ @ broadcast.jl:671 within `_broadcast_getindex`
; │││││││┌ @ broadcast.jl:696 within `_getindex`
; ││││││││┌ @ broadcast.jl:665 within `_broadcast_getindex`
; │││││││││┌ @ essentials.jl:918 within `getindex`
	vmovdqu	ymm2, ymmword ptr [rax + 8*rsi]
	vmovdqu	ymm3, ymmword ptr [rax + 8*rsi + 32]
	vmovdqu	ymm4, ymmword ptr [rax + 8*rsi + 64]
	vmovdqu	ymm5, ymmword ptr [rax + 8*rsi + 96]
; │││││││└└└
; │││││││ @ broadcast.jl:672 within `_broadcast_getindex`
; │││││││┌ @ broadcast.jl:699 within `_broadcast_getindex_evalf`
; ││││││││┌ @ number.jl:162 within `sign`
; │││││││││┌ @ essentials.jl:797 within `ifelse`
	vpcmpgtq	ymm6, ymm0, ymm2
	vblendvpd	ymm2, ymm0, ymm2, ymm6
	vpcmpgtq	ymm6, ymm0, ymm3
	vblendvpd	ymm3, ymm0, ymm3, ymm6
	vpcmpgtq	ymm6, ymm0, ymm4
	vblendvpd	ymm4, ymm0, ymm4, ymm6
	vpcmpgtq	ymm6, ymm0, ymm5
	vblendvpd	ymm5, ymm0, ymm5, ymm6
	vpcmpgtq	ymm6, ymm1, ymm2
	vpor	ymm2, ymm6, ymm2
	vpcmpgtq	ymm6, ymm1, ymm3
	vpor	ymm3, ymm6, ymm3
	vpcmpgtq	ymm6, ymm1, ymm4
	vpor	ymm4, ymm6, ymm4
	vpcmpgtq	ymm6, ymm1, ymm5
	vpor	ymm5, ymm6, ymm5
; ││││└└└└└└
; ││││┌ @ array.jl:986 within `setindex!`
; │││││┌ @ array.jl:991 within `_setindex!`
	vmovdqu	ymmword ptr [rcx + 8*rsi], ymm2
	vmovdqu	ymmword ptr [rcx + 8*rsi + 32], ymm3
	vmovdqu	ymmword ptr [rcx + 8*rsi + 64], ymm4
	vmovdqu	ymmword ptr [rcx + 8*rsi + 96], ymm5
; ││││└└
; ││││ @ simdloop.jl:78 within `macro expansion`
; ││││┌ @ int.jl:87 within `+`
	add	rsi, 16
	cmp	rdx, rsi
	jne	.LBB0_18

GunnarFarneback · April 15, 2025, 9:19am

I haven’t looked at the generated code but I see slower speeds for all other widths than Int64.

foobar_lv2 · April 15, 2025, 9:41am

Good point. The function needs to be something like _sign(x::T) where T = convert(T, x>0) - convert(T, x<0). The vectorized benchmark suggests that _sign wins over sign for Int64 and Int128, and loses for smaller sizes, on avx2. Annoying that llvm can’t figure that out!

eldee · April 15, 2025, 9:48am

Why wouldn’t you just use something like

sign2(x::Int64) = (x >> 63) | 1
sign2(x::Int32) = (x >> 31) | Int32(1)  # (or just | 1 if you want to return an Int64) 
sign2(x::Int16) = (x >> 15) | Int16(1)

julia> sign2(-123)
-1

julia> sign2(123)
1

julia> v = 1:1_000; w = similar(v); @btime map!(sign, $w, $v); @btime map!(sign2, $w, $v);
  139.677 ns (0 allocations: 0 bytes)
  89.876 ns (0 allocations: 0 bytes)

julia> v = Int32.(1:1_000); w = similar(v); @btime map!(sign, $w, $v); @btime map!(sign2, $w, $v);
  45.812 ns (0 allocations: 0 bytes)
  45.096 ns (0 allocations: 0 bytes)

julia> v = Int16.(1:1_000); w = similar(v); @btime map!(sign, $w, $v); @btime map!(sign2, $w, $v);
  24.649 ns (0 allocations: 0 bytes)
  24.574 ns (0 allocations: 0 bytes)

?

Edit: Oh, because sign2(0) == 1.

giordano · April 15, 2025, 9:52am

For the record, if you click on the “Share” button on the top right corner you can generate a link for others to see the generated code: Compiler Explorer

Sukera · April 15, 2025, 2:08pm

I’m aware, but I checked godbolt on a device where I’m not logged into discourse The link to it simply came from the automatic link creation that discourse does when it sees an URL. I figured that these breadcrumbs were enough for someone else to do a deeper dive.

Oscar_Smith · April 15, 2025, 2:59pm

Interestingly, on my CPU (zen5, I see 2x worse performance for _sign. Presumably the difference is AVX-512

julia> @btime map!(sign, $w, $v);
  25.732 ns (0 allocations: 0 bytes)

julia> @btime map!(_sign, $w, $v);
  43.862 ns (0 allocations: 0 bytes)

Sukera · April 15, 2025, 3:31pm

I think this thread is a good example for why the context leading to “why am I optimizing” is important. What might look like a 13% speedup in the small case might be much worse on a larger case. Or it might just be a 20ns overhead from the existing implementation making different use of vector units in the CPU. Taken in isolation, changing sign like this may seem like a speedup, but in practice this is either unlikely to lead to a speedup, or is slower than the existing approach (depending on the execution environment). I think it’s pretty unlikely that a mapped sign is going to be the bottleneck in an algorithm, so changing this in Base (even if we’re only looking at happy results) is probably not going to lead to real speedups.

Oscar_Smith · April 15, 2025, 3:50pm

in my mind, the main thing this shows is room for improvement in LLVM more than Julia. trying to manually play the game of shaving cycles on modern cpus is somewhat of a fool’s errand. if LLVM has a good understanding of what the options are and what the costs are, it will do a good job. If not, it won’t.

Palli · April 15, 2025, 5:22pm

The status quo is actually 45% slower on my machine vs. your 13%, so please make a PR, maybe rather with this:

julia> _sign(x::Integer) = (x > zero(x)) - (x < zero(x))  # Better, still would return wrong type for Rationals. I get same code for unsigned.

julia> _sign(NaN)  # wrong (when floats/reals allowed), should be NaN; works though for + or -Inf
0

julia> @code_native _sign(-1//2)  # Much shorter, but I intentionally didn't use Real type, since doesn't work for floats.

The code in Base is actually branchless too, but with more tests/cmovs.

Would likely be slower? I.e. could the current cmovs work on a full SIMD register?

Topic		Replies	Views
Performance: replace conditional jump by some shift/and/or? Performance question	10	940	November 14, 2019
Simple loop won't vectorize New to Julia	12	1690	January 29, 2019
Why does Julia not optimize this code when C++ (LLVM) can? Performance	14	1364	February 29, 2020
How to shift bits faster General Usage	8	3525	January 10, 2019
Why is [x]or (a function call) so long/slow (not inlined)? Offtopic	23	2963	January 5, 2017

Possibly mildly faster sign for integers?

Related topics