I’ve played around with div
, rem
and mod
for signed bit integers. I first noticed that these functions are faster for Int32
than for Int64
on amd64 processors. (This becomes clear if you look at the processor instruction set.) Then I tried smaller integers. On several processors (all amd64), the following definitions are faster than the standard functions in Julia:
Base.div(x::Int16, y::Int16) = div(x % Int32, y % Int32) % Int16
Base.rem(x::Int16, y::Int16) = rem(x % Int32, y % Int32) % Int16
Base.mod(x::Int8, y::Int8) = (z = rem(x, y); ifelse(iszero(z) || signbit(z) == signbit(y), z, z+y))
With the benchmark code
using Chairmarks
M = 1000
for T in (Int32, Int16, Int8)
@show T
p = rand(T, M)
q = map(_ -> (y = rand(T); y in (0, -1) ? T(9) : y), 1:M)
display(@b similar(q) map!(div, _, $p, $q), map!(rem, _, $p, $q), map!(mod, _, $p, $q))
end
I get for the standard functions the timings
T |
div |
rem |
mod |
---|---|---|---|
Int32 |
3.639 μs | 3.080 μs | 3.921 μs |
Int16 |
3.919 μs | 7.016 μs | 8.913 μs |
Int8 |
3.372 μs | 3.090 μs | 9.828 μs |
and for the new ones
T |
div |
rem |
mod |
---|---|---|---|
Int32 |
3.637 μs | 3.082 μs | 3.921 μs |
Int16 |
2.803 μs | 3.081 μs | 4.475 μs |
Int8 |
3.361 μs | 3.081 μs | 4.300 μs |
I’d be curious to know what other people observe, both for amd64 processors and for other architectures. Thanks in advance for posting your benchmarks below!