High-Performance, Lane-Based Segmented Sieve in Julia

Hi everyone
I wanted to share a program I’ve been working on within my GC60 project, which focuses on high-magnitude prime searching (beyond the 64-bit barrier). The goal wasn’t just to write a fast sieve, but to explore how a specialized mathematical structure, a Modulo-30 residue system organized into 8 computational lanes, interacts with Julia’s high-level capabilities.

The Challenge:
How can we implement a non-linear, lane-based sieve in a high-level language without being crushed by the overhead of memory management and thread synchronization?

The Implementation:
To achieve near-native performance, I focused on three specific Julia optimizations:

Memory Density: Using BitVector to align segment sizes with CPU L1/L2 cache.
GC Avoidance: Implementing all “Prep” and “Mapping” phases as in-place operations to eliminate mid-computation allocations. Load Balancing: Using Atomic{Int} for a dynamic work-stealing pattern to keep all threads busy despite the irregular distribution of primes across the 8 lanes.

The Results:
In a direct comparison of the core logic:

Primesieve (Kim Walisch):        ~1.911s
M30x8 (Julia Specialized Logic): ~1.046s

Note: While a direct C++ implementation of the M30x8 architecture is faster (~0.6s) due to hardware-level SIMD/AVX2 optimizations, this Julia implementation still significantly outperforms the industry-standard Primesieve (1.911s). This highlights that strategic algorithmic design can often outweigh raw language-level speed in high-performance computing tasks.

Verification:
All results were rigorously validated using SymPy in Python to ensure 100% mathematical correctness.

The full implementation is available on GitHub. The README includes benchmark specifications and timing results up to 10^24, feel free to download it, run it on your own machine, and see for yourself.

Repo: GitHub - Claugo/GC60_M30x8_suffix: High-Performance, Lane-Based Segmented Sieve in Julia · GitHub

That’s a really smart approach! I was just working on getting a segmented sieve for Primes.jl (Segmented sieve for `primes`, add `eachprime` iterator. - Pull Request #120 - JuliaMath/Primes.jl - GitHub), but this approach seems much nicer. Would you consider adding it as a PR to Primes (or any objections if I do)?

One other thing worth noting is that your ranges for the benchmarks are pretty small. For such a small range, rather than sieving you’re a lot better off by doing a basic primality test per number.

julia> using Primes
julia> function num_primes(range)
           num = 0
           for i in range
               isprime(i) && (num += 1)
           end
           num
       end
num_primes (generic function with 1 method)

julia> @time num_primes(Int128(10)^19:Int128(10)^19+10^5)
  0.022324 seconds (308.50 k allocations: 12.371 MiB) # vs 1s for the sieve
2263

well done!

out of curiosity what is this? (I can’t find it online)

Of course you can do that; I’d really appreciate it

This GC-60 project was published on Zenodo records 18657440. From there, algorithms were developed capable of reaching 10^27 in just over 3 hours, such as GC60_M30x3_suffix, which precedes X8. This GC60_M30x8_suffix has a slightly different concept and spans 8 columns, which is why I chose to implement it in Julia, as it seemed well-suited to managing the 8 threads very efficiently.