Faster `div`, `rem` and `mod` for small integers?

matthias314 · June 27, 2025, 12:03am

I’ve played around with div, rem and mod for signed bit integers. I first noticed that these functions are faster for Int32 than for Int64 on amd64 processors. (This becomes clear if you look at the processor instruction set.) Then I tried smaller integers. On several processors (all amd64), the following definitions are faster than the standard functions in Julia:

Base.div(x::Int16, y::Int16) = div(x % Int32, y % Int32) % Int16
Base.rem(x::Int16, y::Int16) = rem(x % Int32, y % Int32) % Int16
Base.mod(x::Int8, y::Int8) = (z = rem(x, y); ifelse(iszero(z) || signbit(z) == signbit(y), z, z+y))

With the benchmark code

using Chairmarks
M = 1000
for T in (Int32, Int16, Int8)
    @show T
    p = rand(T, M)
    q = map(_ -> (y = rand(T); y in (0, -1) ? T(9) : y), 1:M)
    display(@b similar(q) map!(div, _, $p, $q), map!(rem, _, $p, $q), map!(mod, _, $p, $q))
end

I get for the standard functions the timings

`T`	`div`	`rem`	`mod`
`Int32`	3.639 μs	3.080 μs	3.921 μs
`Int16`	3.919 μs	7.016 μs	8.913 μs
`Int8`	3.372 μs	3.090 μs	9.828 μs

and for the new ones

`T`	`div`	`rem`	`mod`
`Int32`	3.637 μs	3.082 μs	3.921 μs
`Int16`	2.803 μs	3.081 μs	4.475 μs
`Int8`	3.361 μs	3.081 μs	4.300 μs

I’d be curious to know what other people observe, both for amd64 processors and for other architectures. Thanks in advance for posting your benchmarks below!

Oscar_Smith · June 27, 2025, 12:47am

I can reproduce this on Zen5. Looking at the code, the difference in assembly is very small. It’s simply the difference between

	mov	eax, edi
	cwd
	idiv	si
	mov	eax, edx
                                        # kill: def $ax killed $ax killed $eax
	ret

for Base.rem vs

	movsx	eax, di
; ││ @ int.jl:296 within `rem`
	cdq
	idiv	esi
	mov	eax, edx
; ││ @ int.jl:544 within `rem`
                                        # kill: def $ax killed $ax killed $eax
	ret

for your rem. It is very unclear to me why this makes a difference.

matthias314 · June 27, 2025, 1:12am

EDIT: Forget this, it’s wrong, at least for Int32.

On a related note, the standard definition for rem(x, y) checks if y == -1:

; Function Signature: rem(Int32, Int32)
;  @ int.jl:296 within `rem`
define i32 @julia_rem_1412(i32 signext %"x::Int32", i32 signext %"y::Int32") #0 {
top:
  switch i32 %"y::Int32", label %oksrem [
    i32 0, label %fail
    i32 -1, label %after_srem
  ]

[...]

oksrem:                                           ; preds = %top
  %0 = srem i32 %"x::Int32", %"y::Int32"
  br label %after_srem

after_srem:                                       ; preds = %oksrem, %top
  %1 = phi i32 [ %0, %oksrem ], [ 0, %top ]
  ret i32 %1
}

~~I think this is superfluous because the only bad argument is y == 0. Once this case has been ruled out, one can safely call LLVM’s srem.~~

Oscar_Smith · June 27, 2025, 1:38am

I believe we do this to avoid issues with rem(typemin(Int), -1)

matthias314 · June 27, 2025, 1:55am

Sorry, I was wrong – srem is translated to idiv on amd64, which does not allow the combination of typemin and -1.

Topic		Replies	Views
What is faster? Modular Arithmetic functions and operations Performance question	20	2275	December 4, 2020
Optimizing remainder with constant modulus implementation Performance	2	526	April 6, 2021
Possibly mildly faster sign for integers? Performance	12	366	April 15, 2025
Faster `findmin` without LoopVectorization.jl Performance question	35	905	May 14, 2025
DarkIntegers, a pure Julia library for modulo integer and polynomial arithmetic Package Announcements package , announcement	4	790	March 9, 2020

Faster `div`, `rem` and `mod` for small integers?

Related topics