Hi everyone
I wanted to share a program I’ve been working on within my GC60 project, which focuses on high-magnitude prime searching (beyond the 64-bit barrier). The goal wasn’t just to write a fast sieve, but to explore how a specialized mathematical structure, a Modulo-30 residue system organized into 8 computational lanes, interacts with Julia’s high-level capabilities.
The Challenge:
How can we implement a non-linear, lane-based sieve in a high-level language without being crushed by the overhead of memory management and thread synchronization?
The Implementation:
To achieve near-native performance, I focused on three specific Julia optimizations:
Memory Density: Using BitVector to align segment sizes with CPU L1/L2 cache.
GC Avoidance: Implementing all “Prep” and “Mapping” phases as in-place operations to eliminate mid-computation allocations. Load Balancing: Using Atomic{Int} for a dynamic work-stealing pattern to keep all threads busy despite the irregular distribution of primes across the 8 lanes.
The Results:
In a direct comparison of the core logic:
Primesieve (Kim Walisch): ~1.911s
M30x8 (Julia Specialized Logic): ~1.046s
Note: While a direct C++ implementation of the M30x8 architecture is faster (~0.6s) due to hardware-level SIMD/AVX2 optimizations, this Julia implementation still significantly outperforms the industry-standard Primesieve (1.911s). This highlights that strategic algorithmic design can often outweigh raw language-level speed in high-performance computing tasks.
Verification:
All results were rigorously validated using SymPy in Python to ensure 100% mathematical correctness.
The full implementation is available on GitHub. The README includes benchmark specifications and timing results up to 10^24, feel free to download it, run it on your own machine, and see for yourself.
Repo: GitHub - Claugo/GC60_M30x8_suffix: High-Performance, Lane-Based Segmented Sieve in Julia · GitHub