Sieving for Bsmooth values

I am playing around with a quadratic sieve for factorization of large numbers. Most of the work of this algorithm is searching for Bsmooth numbers. If you know the sieve of eratosthenes you can use this process to sieve for Bsmooth numbers by adding the logs base 2 of your factor base over the numbers you are searching through. I learned of this from watching this sieres by Ilya Gazman

This is giving me a 30x speed up over using gcd over the product of every 100 primes in the factor base.

Section ____________ ncalls time %tot __ avg __ alloc __ %tot ____ avg

Smoothness test 2 _____ 111 79.8s 98.8% 719ms 17.9GiB 99.4% 165MiB

Searching 27 million numbers is taking 13 seconds using codespaces with a 2 core virtual machine. With the gcd method it would take 6 minutes…

I dont know if anyone has speed ups for the sieving part or the indentifying the smooth numbers part

          #sieve using factorbase ignoring the first 100 
        #because those are expensive to mark
        for i in 101:6600
            stu=fb[i]-(sta%fb[i])
            for g in stu:fb[i]:stump
                cabin[g]+=logs[i]
            end
        end

	#identify bsmooth numbers
	lmax = bitlength(num)-16
	for i in eachindex(cabin)
		if cabin[i]>lmax
			push!(rn,i+sta)
		end
	end

The second part is taking a crazy amount of time…maybe I should use filter() and assign the type ::Int16 to lmax…? If I use filter I need to get back the i values not the values of cabin…the amount of Bsmooth values I am looking for is only 1 to 8 per search so its not because I am pushing the values on rn…

Any chance you could contribute a quadratic sieve to Primes.jl? Currently we’re just using Pollard Rho, and it wold be good to have a better algorithm for big numbers. I started adding an ECM implementation here add lenstra eliptic curve factoring by oscardssmith · Pull Request #104 · JuliaMath/Primes.jl · GitHub but never finished cleaning it up enough to merge.

I am using Primes.jl to generarte the factor base. It seems really effecient. I noticed that the factorization algorithm for Primes.jl takes 100MB of memory everytime it’s called for numbers around 20 digits which is maybe because of the amount of iterations of phollard rho. I am just a hobbyist. Right now I am not doing a proper quadratic sieve but if I go down that road then I will message you.

Glad it’s working well for you!

I’m not able to reproduce the allocations ammount at all, (or did you mean factoring a number with 2 20 digit factors?)

julia> x = nextprime(rand(1:big(10)^11))*nextprime(rand(1:big(10)^9))
2748861241862871151

julia> @btime factor(x)
  7.183 ms (312156 allocations: 5.14 MiB)
32529433 * 84503816647

The allocations that are there are from allocating BigInt arithmetic. Looking at the GC time, this code could probably be ~30% faster if we switched to in place BigInt math (or if the compiler got a lot smarter about moving the BigInt finalizes earlier).

I can’t reproduce it either maybe its because I was using timeroutputs.jl and I was running with my other code with it just to see how fast it is. Total memory use was over 15 gbs and 100mb per call.

So it seems that large numbers were contaminating my results. For 20 digit numbers the memory use is very reasonable. Most of my numbers were 20 digit size but I guess I had some large ones mixed in there.

1 Like

To better answer the question you were actually asking rather going off on a tangent, it would be necessary to get a MWE of what you’re trying to do. Having run-able code makes figuring out the performance issues a lot easier.

image

using Primes, TimerOutputs, Random
""" 
Calculate log base 2 floored 
"""
function bitlength(x::BigInt)::Int16
    return length(digits(x;base=2))-1
end

function isSmooth2(num::BigInt,rang::Int)
	@timeit to "Smoothness test 2" begin 
	rn = BigInt[]
	bump = 2^(rang-1)
	stump = 2^rang+one
	sta = num - bump - one
	@timeit to "create vector" begin cabin = zeros(Int16,stump) end
	#sieve using factorbase ignoring the first 100 
	#because those are expensive to mark
	println(typeof(logs))
	@timeit to "sieving" begin
	for i in 101:6600
		stu=fb[i]-(sta%fb[i])
		cabin[stu:fb[i]:stump] .+=logs[i]
	end
	#sieve using the extended factorbase
	#for i in np
	#	stu=np[i]-(sta%np[i])
	#	cabin[stu:np[i]:stump] .+=logs2[i]
	#end
	end #timeit
	@timeit to "identify the bsmooth values" begin
	#find bsmooth numbers
	lmax = bitlength(num)-16
	for i in eachindex(cabin)
		if cabin[i]>lmax
			push!(rn,i+sta)
		end
	end
	end #timeit
	#expand search for Bsmooth values
	cs=0
	while length(rn)==0
		if cs<0
			cs-=1
		else
			cs+=1
		end
		sta+=stump*cs
		@timeit to "create vector" begin cabin = zeros(Int16,stump) end
        @timeit to "sieving" begin
		for i in 101:6600
			stu=fb[i]-(sta%fb[i])
			cabin[stu:fb[i]:stump] .+=logs[i]
		end
		#for i in np
		#	stu=np[i]-(sta%np[i])
		#	cabin[stu:np[i]:stump] .+=logs2[i]
		#end
        end #timeit
		@timeit to "identify the bsmooth values" begin
		for i in eachindex(cabin)
			if cabin[i]>lmax
				push!(rn,i+sta)
			end
		end
		end #timeit
		#now search in the the block of numbers 
		#in the opposite direction
        print(" ($cs) ")
		cs*=-1
	end
	println("The length of rn is $(length(rn)) and we did $(abs(cs)+1) searches!")
	return(rn)
	end #time it
end

begin
    const to = TimerOutput()
    global fb = []
    fb=primes(66103)
    global logs = zeros(Int16,length(fb))
    global const one = BigInt(1)
    local guf=0
    local sqrtof2::Float64= 1.4142135623730951
    #calculate log base 2 this time rounded....
    for i in 1:6600
        if fb[i] > sqrtof2
                guf += 1
                sqrtof2 *= 2.0
        end
        logs[i]=guf
    end
    for i in 1:10
        x::BigInt= rand(BigInt(2)<<90:BigInt(2)<<100)
        println(x)
        display(isSmooth2(x,20))
    end
    show(to)
end

Seems like now sieving is the only expensive part…the problem seems to be not defining the type of lmax. Now I need to know if its possible to speedup the sieving loop. Would threading help?

The performance problem here is just that logs is a global variable (and all the other global variables)

1 Like

so should paste the function directly into my code to avoid having global arrays?

My refactor of this would be

function bitlength(x::BigInt)::Int16
    return length(digits(x;base=2))-1
end

const ONE = BigInt(1)

function isSmooth2(num::BigInt, rang::Int, small_primes, logs)
	rn = BigInt[]
	bump = 2^(rang-1)
	stump = 2^rang+ONE
	sta = num - bump - ONE
	cabin = zeros(Int16,stump)
	#sieve using factorbase ignoring the first 100 
	#because those are expensive to mark
	for i in 101:6600
		stu=small_primes[i] - (sta%small_primes[i])
		cabin[stu:small_primes[i]:stump] .+= logs[i]
	end
	#sieve using the extended factorbase
	#for i in np
	#	stu=np[i]-(sta%np[i])
	#	cabin[stu:np[i]:stump] .+=logs2[i]
	#end
	#find bsmooth numbers
	lmax = bitlength(num)-16
	for i in eachindex(cabin)
		if cabin[i]>lmax
			push!(rn,i+sta)
		end
	end
	#expand search for Bsmooth values
	cs=0
	while isempty(rn)
		if cs<0
			cs-=1
		else
			cs+=1
		end
		sta+=stump*cs
		cabin = zeros(Int16,stump)
		for i in 101:6600
			stu=small_primes[i]-(sta%small_primes[i])
			cabin[stu:small_primes[i]:stump] .+= logs[i]
		end
		#for i in np
		#	stu=np[i]-(sta%np[i])
		#	cabin[stu:np[i]:stump] .+=logs2[i]
		#end
		for i in eachindex(cabin)
			if cabin[i]>lmax
				push!(rn,i+sta)
			end
		end
		#now search in the the block of numbers 
		#in the opposite direction
		cs*=-1
	end
	return rn
end

function make_logs(small_primes)
    logs = zeros(Int16,length(small_primes))
    guf = 0
    sqrtof2 = sqrt(2)
    #calculate log base 2 this time rounded....
    for i in 1:6600
        if small_primes[i] > sqrtof2
            guf += 1
            sqrtof2 *= 2.0
        end
        logs[i] = guf
    end
    logs
end

small_primes = primes(66103)
logs = make_logs(small_primes)

for i in 1:10
    x::BigInt= rand(BigInt(2)<<90:BigInt(2)<<100)
    println(x)
    display(isSmooth2(x, 20, small_primes, logs))
end

which takes 0.4 seconds per issmooth call. I’ve also removed TimerOutputs. since IMO it’s usually better to profile code (e.g. using @profview from ProfileView) rather than inserting points to time manually.

4 Likes

Wow thanks a lot for taking the times to refactor my code…this helps me understand how to do thing in a more correct manner.

Since small_primes doesnt change and it gets accessed a lot maybe I can make it a static array…thats something I have thought about I don’t know if you have thoughts on this idea. Also I could pass it in the function with ! to try to get the compiler to reuse the same cache.

after testing: The “!” did not help for passing in the small_primes array…also tried declaring small_primes and logs const and did nothing…Static Arrays is only good for small arrays so thats a no go…Your refactored code ended up giving me a 20 percent speed up which is significant cause this is the workhorse of my program…I am guessing the speed gain came from not having global variables.

I don’t think making it static array would help anything.

1 Like

I tried using ProfileView but I am finding it impossible to understand and I get a bunch of errors one of which is a critical one

(julia.exe:4816): Gdk-CRITICAL **: 11:21:52.769: load_layout_dll: assertion ‘dll != NULL’ failed

I am using vscode locally. My machine is Windows 8.1. Seems others also have had problems trying to install it…I get the flame graph but as soon as my code stops it disappears…haha…most of the code it refers to is not my code but the julia libraries which I dont know what to do with…most of the colors are green light and dark …purple. Also if I try to save the flamegraph…as soon as I press a button the instance of julia crashes with a bunch of Gdk errors…which I think has to do with Gtk …

maybe in the future I will be able to decipher this flamegraph stuff but for now I think its just too advanced for me. TimerOutputs is just the easiest to use…I even find Benchmarktools difficult to use.

1 Like

I have now got at least a 50x speed up. Turns out that stump was being typed as a BigInt by the compiler getting rid of that gave the majority of the speed up. Then ChatGPT told me to use the views macro when taking part of an array…I think in the future I will probably just pass in a slice of the array. The other thing is the initializing of the arrays can be improved using malloc or something like that…I was investigating that last night…they were saying that zeros() is a legacy way of allocating memory. On your system I believe the code will run in under 10ms. Now my program should run scary fast and the possibility of having a working program to contribute in the future is greatly increased. Total increase from GCD division is at least 1500x…which is insane.

The other thing is the initializing of the arrays can be improved using malloc or something like that…I was investigating that last night…they were saying that zeros() is a legacy way of allocating memory.

What are you talking about? zeros here should be completely fine.

1 Like

10 min video about memory allocations

Anyways when the search space is in the millions I will be initializing using a wheel so this may not be so important

After testing: You are correct using zeros() is fast. For my code lazy initialization of the array actally had a 2x slow down on my code.