Faster `div`, `rem` and `mod` for small integers?

I’ve played around with div, rem and mod for signed bit integers. I first noticed that these functions are faster for Int32 than for Int64 on amd64 processors. (This becomes clear if you look at the processor instruction set.) Then I tried smaller integers. On several processors (all amd64), the following definitions are faster than the standard functions in Julia:

Base.div(x::Int16, y::Int16) = div(x % Int32, y % Int32) % Int16
Base.rem(x::Int16, y::Int16) = rem(x % Int32, y % Int32) % Int16
Base.mod(x::Int8, y::Int8) = (z = rem(x, y); ifelse(iszero(z) || signbit(z) == signbit(y), z, z+y))

With the benchmark code

using Chairmarks
M = 1000
for T in (Int32, Int16, Int8)
    @show T
    p = rand(T, M)
    q = map(_ -> (y = rand(T); y in (0, -1) ? T(9) : y), 1:M)
    display(@b similar(q) map!(div, _, $p, $q), map!(rem, _, $p, $q), map!(mod, _, $p, $q))
end

I get for the standard functions the timings

T div rem mod
Int32 3.639 μs 3.080 μs 3.921 μs
Int16 3.919 μs 7.016 μs 8.913 μs
Int8 3.372 μs 3.090 μs 9.828 μs

and for the new ones

T div rem mod
Int32 3.637 μs 3.082 μs 3.921 μs
Int16 2.803 μs 3.081 μs 4.475 μs
Int8 3.361 μs 3.081 μs 4.300 μs

I’d be curious to know what other people observe, both for amd64 processors and for other architectures. Thanks in advance for posting your benchmarks below!

I can reproduce this on Zen5. Looking at the code, the difference in assembly is very small. It’s simply the difference between

	mov	eax, edi
	cwd
	idiv	si
	mov	eax, edx
                                        # kill: def $ax killed $ax killed $eax
	ret

for Base.rem vs

	movsx	eax, di
; ││ @ int.jl:296 within `rem`
	cdq
	idiv	esi
	mov	eax, edx
; ││ @ int.jl:544 within `rem`
                                        # kill: def $ax killed $ax killed $eax
	ret

for your rem. It is very unclear to me why this makes a difference.

1 Like

EDIT: Forget this, it’s wrong, at least for Int32.

On a related note, the standard definition for rem(x, y) checks if y == -1:

; Function Signature: rem(Int32, Int32)
;  @ int.jl:296 within `rem`
define i32 @julia_rem_1412(i32 signext %"x::Int32", i32 signext %"y::Int32") #0 {
top:
  switch i32 %"y::Int32", label %oksrem [
    i32 0, label %fail
    i32 -1, label %after_srem
  ]

[...]

oksrem:                                           ; preds = %top
  %0 = srem i32 %"x::Int32", %"y::Int32"
  br label %after_srem

after_srem:                                       ; preds = %oksrem, %top
  %1 = phi i32 [ %0, %oksrem ], [ 0, %top ]
  ret i32 %1
}

I think this is superfluous because the only bad argument is y == 0. Once this case has been ruled out, one can safely call LLVM’s srem.

I believe we do this to avoid issues with rem(typemin(Int), -1)

Sorry, I was wrong – srem is translated to idiv on amd64, which does not allow the combination of typemin and -1.