I am playing around with a quadratic sieve for factorization of large numbers. Most of the work of this algorithm is searching for Bsmooth numbers. If you know the sieve of eratosthenes you can use this process to sieve for Bsmooth numbers by adding the logs base 2 of your factor base over the numbers you are searching through. I learned of this from watching this sieres by Ilya Gazman

This is giving me a 30x speed up over using gcd over the product of every 100 primes in the factor base.

Section ____________ ncalls time %tot __ avg __ alloc __ %tot ____ avg

Smoothness test 2 _____ 111 79.8s 98.8% 719ms 17.9GiB 99.4% 165MiB

Searching 27 million numbers is taking 13 seconds using codespaces with a 2 core virtual machine. With the gcd method it would take 6 minutes…

I dont know if anyone has speed ups for the sieving part or the indentifying the smooth numbers part

```
#sieve using factorbase ignoring the first 100
#because those are expensive to mark
for i in 101:6600
stu=fb[i]-(sta%fb[i])
for g in stu:fb[i]:stump
cabin[g]+=logs[i]
end
end
#identify bsmooth numbers
lmax = bitlength(num)-16
for i in eachindex(cabin)
if cabin[i]>lmax
push!(rn,i+sta)
end
end
```

The second part is taking a crazy amount of time…maybe I should use filter() and assign the type ::Int16 to lmax…? If I use filter I need to get back the i values not the values of cabin…the amount of Bsmooth values I am looking for is only 1 to 8 per search so its not because I am pushing the values on rn…

Any chance you could contribute a quadratic sieve to Primes.jl? Currently we’re just using Pollard Rho, and it wold be good to have a better algorithm for big numbers. I started adding an ECM implementation here add lenstra eliptic curve factoring by oscardssmith · Pull Request #104 · JuliaMath/Primes.jl · GitHub but never finished cleaning it up enough to merge.

I am using Primes.jl to generarte the factor base. It seems really effecient. I noticed that the factorization algorithm for Primes.jl takes 100MB of memory everytime it’s called for numbers around 20 digits which is maybe because of the amount of iterations of phollard rho. I am just a hobbyist. Right now I am not doing a proper quadratic sieve but if I go down that road then I will message you.

Glad it’s working well for you!

I’m not able to reproduce the allocations ammount at all, (or did you mean factoring a number with 2 20 digit factors?)

```
julia> x = nextprime(rand(1:big(10)^11))*nextprime(rand(1:big(10)^9))
2748861241862871151
julia> @btime factor(x)
7.183 ms (312156 allocations: 5.14 MiB)
32529433 * 84503816647
```

The allocations that are there are from allocating `BigInt`

arithmetic. Looking at the GC time, this code could probably be ~30% faster if we switched to in place BigInt math (or if the compiler got a lot smarter about moving the `BigInt`

finalizes earlier).

I can’t reproduce it either maybe its because I was using timeroutputs.jl and I was running with my other code with it just to see how fast it is. Total memory use was over 15 gbs and 100mb per call.

So it seems that large numbers were contaminating my results. For 20 digit numbers the memory use is very reasonable. Most of my numbers were 20 digit size but I guess I had some large ones mixed in there.

1 Like

To better answer the question you were actually asking rather going off on a tangent, it would be necessary to get a MWE of what you’re trying to do. Having run-able code makes figuring out the performance issues a lot easier.

```
using Primes, TimerOutputs, Random
"""
Calculate log base 2 floored
"""
function bitlength(x::BigInt)::Int16
return length(digits(x;base=2))-1
end
function isSmooth2(num::BigInt,rang::Int)
@timeit to "Smoothness test 2" begin
rn = BigInt[]
bump = 2^(rang-1)
stump = 2^rang+one
sta = num - bump - one
@timeit to "create vector" begin cabin = zeros(Int16,stump) end
#sieve using factorbase ignoring the first 100
#because those are expensive to mark
println(typeof(logs))
@timeit to "sieving" begin
for i in 101:6600
stu=fb[i]-(sta%fb[i])
cabin[stu:fb[i]:stump] .+=logs[i]
end
#sieve using the extended factorbase
#for i in np
# stu=np[i]-(sta%np[i])
# cabin[stu:np[i]:stump] .+=logs2[i]
#end
end #timeit
@timeit to "identify the bsmooth values" begin
#find bsmooth numbers
lmax = bitlength(num)-16
for i in eachindex(cabin)
if cabin[i]>lmax
push!(rn,i+sta)
end
end
end #timeit
#expand search for Bsmooth values
cs=0
while length(rn)==0
if cs<0
cs-=1
else
cs+=1
end
sta+=stump*cs
@timeit to "create vector" begin cabin = zeros(Int16,stump) end
@timeit to "sieving" begin
for i in 101:6600
stu=fb[i]-(sta%fb[i])
cabin[stu:fb[i]:stump] .+=logs[i]
end
#for i in np
# stu=np[i]-(sta%np[i])
# cabin[stu:np[i]:stump] .+=logs2[i]
#end
end #timeit
@timeit to "identify the bsmooth values" begin
for i in eachindex(cabin)
if cabin[i]>lmax
push!(rn,i+sta)
end
end
end #timeit
#now search in the the block of numbers
#in the opposite direction
print(" ($cs) ")
cs*=-1
end
println("The length of rn is $(length(rn)) and we did $(abs(cs)+1) searches!")
return(rn)
end #time it
end
begin
const to = TimerOutput()
global fb = []
fb=primes(66103)
global logs = zeros(Int16,length(fb))
global const one = BigInt(1)
local guf=0
local sqrtof2::Float64= 1.4142135623730951
#calculate log base 2 this time rounded....
for i in 1:6600
if fb[i] > sqrtof2
guf += 1
sqrtof2 *= 2.0
end
logs[i]=guf
end
for i in 1:10
x::BigInt= rand(BigInt(2)<<90:BigInt(2)<<100)
println(x)
display(isSmooth2(x,20))
end
show(to)
end
```

Seems like now sieving is the only expensive part…the problem seems to be not defining the type of lmax. Now I need to know if its possible to speedup the sieving loop. Would threading help?

The performance problem here is just that `logs`

is a global variable (and all the other global variables)

1 Like

so should paste the function directly into my code to avoid having global arrays?

My refactor of this would be

```
function bitlength(x::BigInt)::Int16
return length(digits(x;base=2))-1
end
const ONE = BigInt(1)
function isSmooth2(num::BigInt, rang::Int, small_primes, logs)
rn = BigInt[]
bump = 2^(rang-1)
stump = 2^rang+ONE
sta = num - bump - ONE
cabin = zeros(Int16,stump)
#sieve using factorbase ignoring the first 100
#because those are expensive to mark
for i in 101:6600
stu=small_primes[i] - (sta%small_primes[i])
cabin[stu:small_primes[i]:stump] .+= logs[i]
end
#sieve using the extended factorbase
#for i in np
# stu=np[i]-(sta%np[i])
# cabin[stu:np[i]:stump] .+=logs2[i]
#end
#find bsmooth numbers
lmax = bitlength(num)-16
for i in eachindex(cabin)
if cabin[i]>lmax
push!(rn,i+sta)
end
end
#expand search for Bsmooth values
cs=0
while isempty(rn)
if cs<0
cs-=1
else
cs+=1
end
sta+=stump*cs
cabin = zeros(Int16,stump)
for i in 101:6600
stu=small_primes[i]-(sta%small_primes[i])
cabin[stu:small_primes[i]:stump] .+= logs[i]
end
#for i in np
# stu=np[i]-(sta%np[i])
# cabin[stu:np[i]:stump] .+=logs2[i]
#end
for i in eachindex(cabin)
if cabin[i]>lmax
push!(rn,i+sta)
end
end
#now search in the the block of numbers
#in the opposite direction
cs*=-1
end
return rn
end
function make_logs(small_primes)
logs = zeros(Int16,length(small_primes))
guf = 0
sqrtof2 = sqrt(2)
#calculate log base 2 this time rounded....
for i in 1:6600
if small_primes[i] > sqrtof2
guf += 1
sqrtof2 *= 2.0
end
logs[i] = guf
end
logs
end
small_primes = primes(66103)
logs = make_logs(small_primes)
for i in 1:10
x::BigInt= rand(BigInt(2)<<90:BigInt(2)<<100)
println(x)
display(isSmooth2(x, 20, small_primes, logs))
end
```

which takes 0.4 seconds per `issmooth`

call. I’ve also removed TimerOutputs. since IMO it’s usually better to profile code (e.g. using `@profview`

from `ProfileView`

) rather than inserting points to time manually.

4 Likes

Wow thanks a lot for taking the times to refactor my code…this helps me understand how to do thing in a more correct manner.

Since small_primes doesnt change and it gets accessed a lot maybe I can make it a static array…thats something I have thought about I don’t know if you have thoughts on this idea. Also I could pass it in the function with ! to try to get the compiler to reuse the same cache.

**after testing:** The “!” did not help for passing in the small_primes array…also tried declaring small_primes and logs const and did nothing…Static Arrays is only good for small arrays so thats a no go…Your refactored code ended up giving me a 20 percent speed up which is significant cause this is the workhorse of my program…I am guessing the speed gain came from not having global variables.

I don’t think making it static array would help anything.

1 Like

I tried using ProfileView but I am finding it impossible to understand and I get a bunch of errors one of which is a critical one

(julia.exe:4816): Gdk-CRITICAL **: 11:21:52.769: load_layout_dll: assertion ‘dll != NULL’ failed

I am using vscode locally. My machine is Windows 8.1. Seems others also have had problems trying to install it…I get the flame graph but as soon as my code stops it disappears…haha…most of the code it refers to is not my code but the julia libraries which I dont know what to do with…most of the colors are green light and dark …purple. Also if I try to save the flamegraph…as soon as I press a button the instance of julia crashes with a bunch of Gdk errors…which I think has to do with Gtk …

maybe in the future I will be able to decipher this flamegraph stuff but for now I think its just too advanced for me. TimerOutputs is just the easiest to use…I even find Benchmarktools difficult to use.

1 Like

I have now got at least a 50x speed up. Turns out that stump was being typed as a BigInt by the compiler getting rid of that gave the majority of the speed up. Then ChatGPT told me to use the views macro when taking part of an array…I think in the future I will probably just pass in a slice of the array. The other thing is the initializing of the arrays can be improved using malloc or something like that…I was investigating that last night…they were saying that zeros() is a legacy way of allocating memory. On your system I believe the code will run in under 10ms. Now my program should run scary fast and the possibility of having a working program to contribute in the future is greatly increased. Total increase from GCD division is at least 1500x…which is insane.

The other thing is the initializing of the arrays can be improved using malloc or something like that…I was investigating that last night…they were saying that zeros() is a legacy way of allocating memory.

What are you talking about? `zeros`

here should be completely fine.

1 Like

10 min video about memory allocations

Anyways when the search space is in the millions I will be initializing using a wheel so this may not be so important

**After testing**: You are correct using zeros() is fast. For my code lazy initialization of the array actally had a 2x slow down on my code.