Making Jacobi symbol function faster

After wrapping amazingly fast OpenSSL for group operations (see OpenSSLGroups.jl), I found a new bottleneck for proof of shuffle verification in ShuffleProofs.jl. Half of the time my code now spends computing a Jacobi symbol in CryptoUtils.jl:

function jacobi(n::Integer, k::Integer)
    rem(k, 2) != zero(k) || throw("Argument k=$k should be odd.")

    n = mod(n, k)
    t = 1
    while n != zero(n)
        while rem(n, 2) == 0
            n = div(n, 2)
            r = k % 8
            if r == 3 || r == 5
                t = -t
            end
        end
        n, k = k, n
        if n % 4 == 3 && k % 4 == 3
            t = -t
        end
        n = mod(n, k)
    end
    if k == oneunit(k)
        return t
    else
        return 0
    end
end

@code_warntype did not show any instabilities. Is there anything that could make this function faster?

1 Like

Probably worth giving https://eprint.iacr.org/2024/1054.pdf a read.

2 Likes

Specifically, this seems to be about 10% faster for Int

function jacobi2(n::Integer, k::Integer)
    iseven(k) && throw("Argument k=$k should be odd.")
    n = mod(n, k)
    t = 1
    k_1 = isodd(k)
    while !iszero(n)
        z = trailing_zeros(n)
        n >>= z
        n_1 = isodd(n)
        if isodd(z) && (k_1 ⊻ (k&2 == 2)) ⊻ (k_1 && n_1)
            t = -t
        end
        k_1 = n_1
        n, k = k, n
        n = mod(n, k)
    end
    return isone(k) ? t : 0
end

Oh and for BigInt it’s ~4x faster

julia> a,b=(rand(1:big(2)^250, 2).*2 .+1)
2-element Vector{BigInt}:
  847773254239334728808956574554461732293015522084940209610487393836041804047
 3461986191583086034964019643491355757375917644513356125044555068927659932347

julia> @btime jacobi(a,b);
  52.511 μs (2706 allocations: 52.09 KiB)

julia> @btime jacobi2(a,b);
  18.216 μs (792 allocations: 15.11 KiB)
2 Likes

This looks good. I also found that I can get around 10x speedup with BitIntegers.jl with the original Jacobi method. For some reason, the jacobi2 method with Int256 is slower than the original jacobi method.

The BitIntegers were tricky to exploit in practice as they happened to be slower for the sqrt_mod_prime function, which I also needed. My solution was to have is_quadratic_residue(U(y), U(p)) where U is computed at runtime while keeping the rest of the code in BigInt as before. Also, I gained a significant boost by removing redundant quadratic residue checks in sqrt_mod_prime. With these changes, the Jacob symbol calculation only takes 1/20th of the verification calculation. A significant success :partying_face:

1 Like

Nice! what is the sqrt_mod_prime function? My guess is that BitIntegers.jl is probably missing some specializations that would make it way faster in general (possibly their iseven or something is bad?)

sqrt_mod_prime relies on powermod function which is much faster with BigInt type.

yeah it appears LLVM’s intrinsics are pretty slow once the size gets to 512 bits or so. That’s a shame.

For BigInt I think something like !iszero(Int(k) & 2) should be faster and allocate less. Just k&2 alone allocates two BigInts.

Maybe mutating BigInt values would be worth considering, too, possibly via MutableArithmetics.jl.

2 Likes

I came across another instance where the Jacobi symbol was a bottleneck, but now with numbers sized in the 2048-bit range. (It is used as an alternative for checking membership for quadratic residue prime groups where p = 2*q + 1 as a faster alternative to test powermod(x, q, p) == 1.) In this range, the BitInteger conversion was significantly slower than that with BigInt.

Profiling code revealed that it all comes down to mod performance, which becomes significantly slower for numbers exceeding the 512-bit range:

using BenchmarkTools
using BitIntegers

BitIntegers.@define_integers 2048 Int2048 UInt2048
BitIntegers.@define_integers 4096 Int4096 UInt4096

for U in [UInt, UInt128, UInt256, UInt512, UInt1024, UInt2048, UInt4096]

    r = rand(U)
    p = rand(U)

    println("$(nameof(U)):")
    @btime mod($r, $p)
    @btime mod($(BigInt(r)), $(BigInt(p)))

    println("")

end

which produces:

UInt64:
  2.125 ns (0 allocations: 0 bytes)
  50.953 ns (2 allocations: 40 bytes)

UInt128:
  4.125 ns (0 allocations: 0 bytes)
  71.124 ns (3 allocations: 48 bytes)

UInt256:
  34.449 ns (0 allocations: 0 bytes)
  111.948 ns (3 allocations: 64 bytes)

UInt512:
  36.043 ns (0 allocations: 0 bytes)
  80.191 ns (3 allocations: 96 bytes)

UInt1024:
  153.629 ns (0 allocations: 0 bytes)
  129.540 ns (3 allocations: 160 bytes)

UInt2048:
  357.338 ns (0 allocations: 0 bytes)
  124.687 ns (3 allocations: 288 bytes)

UInt4096:
  1.475 μs (0 allocations: 0 bytes)
  165.843 ns (3 allocations: 544 bytes)

This made jacobi2 method quite useful to have :slight_smile:

I noticed that the performance step between UInt256 and UInt512 is really small. It would be exceptional to have UInt1024 that evaluates at 40 ns and UInt2048 at 50 ns to continue the trend :nerd_face:

The problem is that once the numbers get bigger you start needing the fancy algorithms

The function produced errors:

Failed: jacobi(2, 5) = 1, expected -1
Failed: jacobi(7, 9) = -1, expected 1

which I did not know how to make AI to fix for me. I managed Claude to spit out a following code for BigInt:

function jacobi(n::BigInt, k::BigInt)

    # Get the last limb (word) of a BigInt
    get_limb(n::BigInt, i::Int) = unsafe_load(n.d, 1 + i)

    iseven(k) && throw("Argument k=$k should be odd.")
    n = mod(n, k)
    t = 1
    k_1 = true  # k is odd by requirement
    last_limb = get_limb(k, 0)
    k_2 = (last_limb & 2 == 2)  # second bit
    k_3 = (last_limb & 4 == 4)  # third bit
    
    while !iszero(n)
        z = trailing_zeros(n)
        n >>= z
        # After right shift, check bits directly on the last limb
        last_limb = get_limb(n, 0)
        n_2 = (last_limb & 2 == 2)  # second bit
        
        # For k ≡ 3,5 (mod 8) when z is odd
        if isodd(z) && (k_2 ⊻ k_3)
            t = -t
        end
        
        # For quadratic reciprocity when both are ≡ 3 (mod 4)
        if k_2 && n_2
            t = -t
        end
        
        # Save current n's bits for next iteration's k
        k_2 = n_2
        k_3 = (last_limb & 4 == 4)
        n, k = k, n
        n = mod(n, k)
    end
    return isone(k) ? t : 0
end

which is also around 4x faster.

1 Like

If you replace get_limb(n, 0) with n%Int, this version will also work for other Integer types.

1 Like