# Optimizing powermod

Here’s how you can implement powermod from scratch in Julia and achieve better performance than the built-in version. (Maybe I need to add more safety checks?)

function my_powermod(a::Int64, b::Int64, c::Int64)
@assert b >= 0
result = one(Int128)
a_raised_to_powers_of_two = convert(Int128, a)
while b != 0
if isodd(b)
result *= a_raised_to_powers_of_two
result = result % c;
end
b = b >>> 1
a_raised_to_powers_of_two = a_raised_to_powers_of_two^2 % c
end
convert(Int64, result)
end


Test:

julia> @btime my_powermod(3,100000000000,102)
583.514 ns (0 allocations: 0 bytes)
69

julia> @btime powermod(3,100000000000,102)
747.575 ns (0 allocations: 0 bytes)
69


What performance do you get if if you write it in pure Python?

2 Likes

I noted in Computing the hexadecimal value of pi that PiBBP.jl/src/bbp.jl at d8950b372586d9a5d01538bfd653bd6415ce64f2 · kalvotom/PiBBP.jl · GitHub is a faster implementation than Julia’s built-in, but going by memory (it’s been a few years since I looked at it) it doesn’t cover some corner cases (maybe negative numbers?)

1 Like

Your github link seems to use the same algorithm as what I came up with, though I use Int128 for intermediate results to prevent overflows in corner cases. (If I allowed generic integer types for the arguments, then I would use widen.)

1 Like

If the modulus is a prime number, the fastest version I found is from Nemo.jl, which calls the C library FLINT.

julia> using Nemo, BenchmarkTools

julia> const a=GF(103)(3) # everything mod 103

julia> @btime a^100000000000
219.446 ns (1 allocation: 80 bytes)
76

julia> @btime powermod(3, 100000000000, 103) # built-in
747.658 ns (0 allocations: 0 bytes)
76

julia> @btime power_mod(3, 100000000000, 103) # from PiBBP.jl
275.689 ns (0 allocations: 0 bytes)
76


(Julia 1.10.4, Nemo 0.43.2)

2 Likes

How on earth are they getting that much faster? Are they taking advantage of the smaller modulus to do the reduction less often or something?

System:
Kernel: 5.15.0-76-generic x86_64 bits: 64 compiler: gcc v: 11.3.0 Desktop: Xfce 4.18.1
tk: Gtk 3.24.33 wm: xfwm dm: LightDM Distro: Linux Mint 21.2 Victoria base: Ubuntu 22.04 jammy
Machine:
Type: Laptop System: LENOVO product: 20DNS01100 v: ThinkPad S3 Yoga 14
CPU:
Info: dual core model: Intel Core i5-5200U bits: 64 type: MT MCP arch: Broadwell rev: 4 cache:
L1: 128 KiB L2: 512 KiB L3: 3 MiB
Speed (MHz): avg: 2475 high: 2598 min/max: 500/2700 cores: 1: 2598 2: 2492 3: 2349 4: 2464
bogomips: 17559
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3

~ $python3.13 -m IPython Python 3.13.0b1 (main, May 10 2024, 12:32:10) [GCC 11.4.0] Type 'copyright', 'credits' or 'license' for more information IPython 8.25.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: def my_powermod(a, b, c) : ...: assert b >= 0 ...: result = 1 ...: a_raised_to_powers_of_two = a ...: while b != 0 : ...: if b%2 : ...: result *= a_raised_to_powers_of_two ...: result = result % c; ...: b = b >> 1 ...: a_raised_to_powers_of_two = a_raised_to_powers_of_two**2 % c ...: return result ...:  In [2]: pow(3,100000000000,102) Out[2]: 69 In [3]: my_powermod(3,100000000000,102) Out[3]: 69 In [4]: %timeit pow(3,100000000000,102) 1.42 μs ± 44.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [5]: %timeit my_powermod(3,100000000000,102) 5.79 μs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)  ~$ julia
_
_       _ _(_)_     |  Documentation: https://docs.julialang.org
(_)     | (_) (_)    |
_ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _ |  |
| | |_| | | | (_| |  |  Version 1.10.4 (2024-06-04)
_/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> function my_powermod(a::Int64, b::Int64, c::Int64)
@assert b >= 0
result = one(Int128)
a_raised_to_powers_of_two = convert(Int128, a)
while b != 0
if isodd(b)
result *= a_raised_to_powers_of_two
result = result % c;
end
b = b >>> 1
a_raised_to_powers_of_two = a_raised_to_powers_of_two^2 % c
end
convert(Int64, result)
end
my_powermod (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime my_powermod(3,100000000000,102)
868.258 ns (0 allocations: 0 bytes)
69

julia> @btime powermod(3,100000000000,102)
1.108 μs (0 allocations: 0 bytes)
69

julia> powermod(3,100000000000,102)
69

julia> my_powermod(3,100000000000,102)
69

julia> using Nemo, BenchmarkTools

julia> const a=GF(103)(3)
3

julia> @btime a^100000000000
346.069 ns (1 allocation: 80 bytes)
76

julia> @btime powermod(3, 100000000000, 103)
1.112 μs (0 allocations: 0 bytes)
76


My guess is that they make use of

3^102 === 1 mod 103


to compute it as

powermod(3, mod(100000000000, 102), 103)

1 Like

oh yeah, we should be doing that probably.

The Project Euler classic!

If the same modulus is used in a large number of computations, you can make the modulus a compile-time constant (using value types), and LLVM will optimize the expensive integer division instructions into cheaper multiplication and bit-shift instructions:

power_mod2(a, b, c) = power_mod_fixed_c(a, b, Val(c))
@fastmath function power_mod_fixed_c(a, b, ::Val{c}) where c
@assert c^2 < typemax(c) #otherwise need widen
r = 1
while b > 0
if isodd(b)
r = r * a % c
end
b = div(b, 2)
a = a * a % c
end
return r
end


Running power_mod2(3, 100000000000, 102) is twice as fast as the original version, though the time for JIT compile (used in an essential way here) is not included. With Nemo.jl, when you construct a prime field GF(103), I think some numbers are pre-computed to enable similar tricks without invoking a JIT compiler, and this pre-computation was also not included in the benchmark numbers I posted.

Note you can use Base.MultiplicativeInverses.multiplicativeinverse(x) if you’d prefer to to avoid the dynamic dispatch.

julia> @btime power_mod2($(Ref(3))[],$(Ref(100000000000))[], $(Ref(102))[]) 200.797 ns (1 allocation: 16 bytes) 69 julia> @btime power_mod3($(Ref(3))[], $(Ref(100000000000))[],$(Ref(102))[])
135.233 ns (0 allocations: 0 bytes)
69


Implementation:

function power_mod3(a, b, c)
@assert c^2 < typemax(c) #otherwise need widen
cfast = Base.MultiplicativeInverses.multiplicativeinverse(c)
r = 1
while b > 0
if isodd(b)
r = r * a % cfast
end
b = div(b, 2)
a = a * a % cfast
end
return r
end


@oscar_smith maybe we should be doing this?

It might also be faster to use trailing_zeros like the power_by_squaring method, which might be faster if you have a lot of 0s (i.e., if isodd would return false` multiple times in a row). Probably won’t make much of a difference in the average case.

2 Likes

This will be a bit tricky, since in general you can reduce the exponent in a^e \bmod n only by \varphi(n), with \varphi the Euler \varphi-function. If n happens to be a prime (which it is in this case), you have \varphi(n) = n - 1. In general you need to factor n to find \varphi(n).