Optimizing `powermod`

Here’s how you can implement powermod from scratch in Julia and achieve better performance than the built-in version. (Maybe I need to add more safety checks?)

function my_powermod(a::Int64, b::Int64, c::Int64)
    @assert b >= 0
    result = one(Int128)
    a_raised_to_powers_of_two = convert(Int128, a)
    while b != 0
        if isodd(b)
            result *= a_raised_to_powers_of_two
            result = result % c;
        end
        b = b >>> 1
        a_raised_to_powers_of_two = a_raised_to_powers_of_two^2 % c
    end
    convert(Int64, result)
end

Test:

julia> @btime my_powermod(3,100000000000,102)
  583.514 ns (0 allocations: 0 bytes)
69

julia> @btime powermod(3,100000000000,102)
  747.575 ns (0 allocations: 0 bytes)
69

What performance do you get if if you write it in pure Python?

2 Likes

I noted in Computing the hexadecimal value of pi that PiBBP.jl/src/bbp.jl at d8950b372586d9a5d01538bfd653bd6415ce64f2 · kalvotom/PiBBP.jl · GitHub is a faster implementation than Julia’s built-in, but going by memory (it’s been a few years since I looked at it) it doesn’t cover some corner cases (maybe negative numbers?)

1 Like

Your github link seems to use the same algorithm as what I came up with, though I use Int128 for intermediate results to prevent overflows in corner cases. (If I allowed generic integer types for the arguments, then I would use widen.)

1 Like

If the modulus is a prime number, the fastest version I found is from Nemo.jl, which calls the C library FLINT.

julia> using Nemo, BenchmarkTools

julia> const a=GF(103)(3) # everything mod 103

julia> @btime a^100000000000
  219.446 ns (1 allocation: 80 bytes)
76

julia> @btime powermod(3, 100000000000, 103) # built-in
  747.658 ns (0 allocations: 0 bytes)
76

julia> @btime power_mod(3, 100000000000, 103) # from PiBBP.jl
  275.689 ns (0 allocations: 0 bytes)
76

(Julia 1.10.4, Nemo 0.43.2)

2 Likes

How on earth are they getting that much faster? Are they taking advantage of the smaller modulus to do the reduction less often or something?

System:
Kernel: 5.15.0-76-generic x86_64 bits: 64 compiler: gcc v: 11.3.0 Desktop: Xfce 4.18.1
tk: Gtk 3.24.33 wm: xfwm dm: LightDM Distro: Linux Mint 21.2 Victoria base: Ubuntu 22.04 jammy
Machine:
Type: Laptop System: LENOVO product: 20DNS01100 v: ThinkPad S3 Yoga 14
CPU:
Info: dual core model: Intel Core i5-5200U bits: 64 type: MT MCP arch: Broadwell rev: 4 cache:
L1: 128 KiB L2: 512 KiB L3: 3 MiB
Speed (MHz): avg: 2475 high: 2598 min/max: 500/2700 cores: 1: 2598 2: 2492 3: 2349 4: 2464
bogomips: 17559
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3

~ $ python3.13 -m IPython
Python 3.13.0b1 (main, May 10 2024, 12:32:10) [GCC 11.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.25.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: def my_powermod(a, b, c) :
   ...:     assert b >= 0
   ...:     result = 1
   ...:     a_raised_to_powers_of_two = a
   ...:     while b  !=  0 :
   ...:         if b%2 :
   ...:             result *= a_raised_to_powers_of_two
   ...:             result = result % c;
   ...:         b = b >> 1
   ...:         a_raised_to_powers_of_two = a_raised_to_powers_of_two**2 % c
   ...:     return result
   ...: 
In [2]: pow(3,100000000000,102)
Out[2]: 69

In [3]: my_powermod(3,100000000000,102)
Out[3]: 69

In [4]: %timeit pow(3,100000000000,102)
1.42 μs ± 44.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [5]: %timeit my_powermod(3,100000000000,102)
5.79 μs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
~ $ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.4 (2024-06-04)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> function my_powermod(a::Int64, b::Int64, c::Int64)
           @assert b >= 0
           result = one(Int128)
           a_raised_to_powers_of_two = convert(Int128, a)
           while b != 0
               if isodd(b)
                   result *= a_raised_to_powers_of_two
                   result = result % c;
               end
               b = b >>> 1
               a_raised_to_powers_of_two = a_raised_to_powers_of_two^2 % c
           end
           convert(Int64, result)
       end
my_powermod (generic function with 1 method)
julia> using BenchmarkTools

julia> @btime my_powermod(3,100000000000,102)
  868.258 ns (0 allocations: 0 bytes)
69

julia> @btime powermod(3,100000000000,102)
  1.108 μs (0 allocations: 0 bytes)
69

julia> powermod(3,100000000000,102)
69

julia> my_powermod(3,100000000000,102)
69
julia> using Nemo, BenchmarkTools

julia> const a=GF(103)(3)
3

julia> @btime a^100000000000
  346.069 ns (1 allocation: 80 bytes)
76

julia> @btime powermod(3, 100000000000, 103)
  1.112 μs (0 allocations: 0 bytes)
76

My guess is that they make use of

3^102 === 1 mod 103

to compute it as

powermod(3, mod(100000000000, 102), 103)
1 Like

oh yeah, we should be doing that probably.

The Project Euler classic!

If the same modulus is used in a large number of computations, you can make the modulus a compile-time constant (using value types), and LLVM will optimize the expensive integer division instructions into cheaper multiplication and bit-shift instructions:

power_mod2(a, b, c) = power_mod_fixed_c(a, b, Val(c))
@fastmath function power_mod_fixed_c(a, b, ::Val{c}) where c
    @assert c^2 < typemax(c) #otherwise need `widen`
    r = 1
    while b > 0
        if isodd(b)
            r = r * a % c
        end
        b = div(b, 2)
        a = a * a % c
    end
    return r
end

Running power_mod2(3, 100000000000, 102) is twice as fast as the original version, though the time for JIT compile (used in an essential way here) is not included. With Nemo.jl, when you construct a prime field GF(103), I think some numbers are pre-computed to enable similar tricks without invoking a JIT compiler, and this pre-computation was also not included in the benchmark numbers I posted.

Note you can use Base.MultiplicativeInverses.multiplicativeinverse(x) if you’d prefer to to avoid the dynamic dispatch.

julia> @btime power_mod2($(Ref(3))[], $(Ref(100000000000))[], $(Ref(102))[])
  200.797 ns (1 allocation: 16 bytes)
69

julia> @btime power_mod3($(Ref(3))[], $(Ref(100000000000))[], $(Ref(102))[])
  135.233 ns (0 allocations: 0 bytes)
69

Implementation:

function power_mod3(a, b, c)
    @assert c^2 < typemax(c) #otherwise need `widen`
    cfast = Base.MultiplicativeInverses.multiplicativeinverse(c)
    r = 1
    while b > 0
        if isodd(b)
            r = r * a % cfast
        end
        b = div(b, 2)
        a = a * a % cfast
    end
    return r
end

@oscar_smith maybe we should be doing this?

It might also be faster to use trailing_zeros like the power_by_squaring method, which might be faster if you have a lot of 0s (i.e., if isodd would return false multiple times in a row). Probably won’t make much of a difference in the average case.

2 Likes

This will be a bit tricky, since in general you can reduce the exponent in a^e \bmod n only by \varphi(n), with \varphi the Euler \varphi-function. If n happens to be a prime (which it is in this case), you have \varphi(n) = n - 1. In general you need to factor n to find \varphi(n).