After wrapping amazingly fast OpenSSL for group operations (see OpenSSLGroups.jl), I found a new bottleneck for proof of shuffle verification in ShuffleProofs.jl. Half of the time my code now spends computing a Jacobi symbol in CryptoUtils.jl:
function jacobi(n::Integer, k::Integer)
rem(k, 2) != zero(k) || throw("Argument k=$k should be odd.")
n = mod(n, k)
t = 1
while n != zero(n)
while rem(n, 2) == 0
n = div(n, 2)
r = k % 8
if r == 3 || r == 5
t = -t
end
end
n, k = k, n
if n % 4 == 3 && k % 4 == 3
t = -t
end
n = mod(n, k)
end
if k == oneunit(k)
return t
else
return 0
end
end
@code_warntype did not show any instabilities. Is there anything that could make this function faster?
Specifically, this seems to be about 10% faster for Int
function jacobi2(n::Integer, k::Integer)
iseven(k) && throw("Argument k=$k should be odd.")
n = mod(n, k)
t = 1
k_1 = isodd(k)
while !iszero(n)
z = trailing_zeros(n)
n >>= z
n_1 = isodd(n)
if isodd(z) && (k_1 ⊻ (k&2 == 2)) ⊻ (k_1 && n_1)
t = -t
end
k_1 = n_1
n, k = k, n
n = mod(n, k)
end
return isone(k) ? t : 0
end
This looks good. I also found that I can get around 10x speedup with BitIntegers.jl with the original Jacobi method. For some reason, the jacobi2 method with Int256 is slower than the original jacobi method.
The BitIntegers were tricky to exploit in practice as they happened to be slower for the sqrt_mod_prime function, which I also needed. My solution was to have is_quadratic_residue(U(y), U(p)) where U is computed at runtime while keeping the rest of the code in BigInt as before. Also, I gained a significant boost by removing redundant quadratic residue checks in sqrt_mod_prime. With these changes, the Jacob symbol calculation only takes 1/20th of the verification calculation. A significant success
Nice! what is the sqrt_mod_prime function? My guess is that BitIntegers.jl is probably missing some specializations that would make it way faster in general (possibly their iseven or something is bad?)
I came across another instance where the Jacobi symbol was a bottleneck, but now with numbers sized in the 2048-bit range. (It is used as an alternative for checking membership for quadratic residue prime groups where p = 2*q + 1 as a faster alternative to test powermod(x, q, p) == 1.) In this range, the BitInteger conversion was significantly slower than that with BigInt.
Profiling code revealed that it all comes down to mod performance, which becomes significantly slower for numbers exceeding the 512-bit range:
using BenchmarkTools
using BitIntegers
BitIntegers.@define_integers 2048 Int2048 UInt2048
BitIntegers.@define_integers 4096 Int4096 UInt4096
for U in [UInt, UInt128, UInt256, UInt512, UInt1024, UInt2048, UInt4096]
r = rand(U)
p = rand(U)
println("$(nameof(U)):")
@btime mod($r, $p)
@btime mod($(BigInt(r)), $(BigInt(p)))
println("")
end
I noticed that the performance step between UInt256 and UInt512 is really small. It would be exceptional to have UInt1024 that evaluates at 40 ns and UInt2048 at 50 ns to continue the trend
which I did not know how to make AI to fix for me. I managed Claude to spit out a following code for BigInt:
function jacobi(n::BigInt, k::BigInt)
# Get the last limb (word) of a BigInt
get_limb(n::BigInt, i::Int) = unsafe_load(n.d, 1 + i)
iseven(k) && throw("Argument k=$k should be odd.")
n = mod(n, k)
t = 1
k_1 = true # k is odd by requirement
last_limb = get_limb(k, 0)
k_2 = (last_limb & 2 == 2) # second bit
k_3 = (last_limb & 4 == 4) # third bit
while !iszero(n)
z = trailing_zeros(n)
n >>= z
# After right shift, check bits directly on the last limb
last_limb = get_limb(n, 0)
n_2 = (last_limb & 2 == 2) # second bit
# For k ≡ 3,5 (mod 8) when z is odd
if isodd(z) && (k_2 ⊻ k_3)
t = -t
end
# For quadratic reciprocity when both are ≡ 3 (mod 4)
if k_2 && n_2
t = -t
end
# Save current n's bits for next iteration's k
k_2 = n_2
k_3 = (last_limb & 4 == 4)
n, k = k, n
n = mod(n, k)
end
return isone(k) ? t : 0
end