Can Julia do this algorithm faster?

As I said at the start, I don’t have the time to learn (well) another whole language. My life currently doesn’t afford me that luxury. If no one is curious enough to want to do it, that’s cool. I thought I’d try though.

It really shouldn’t take that long to do since you have working code to mimic, if Julia is really that easy to code in.

1 Like

If someone feels like doing this they will. If no one feels like it they won’t. Simple enough.

8 Likes

Yes, will see.

I’ll respectfully point out that your advice about using the existing code also has the implicit requirement of knowing at least one of the other languages used so far. As you’ve pointed out, it takes time to learn another programming language well, and it would take even more time to learn how to translate between languages well. So even ignoring all other factors, only a small subset of Julia users can practically translate your code in the near future. With Rust, you had provided Rust code first, so a much larger subset of the community could immediately share their Rust optimization knowledge with you. It’s not just a matter of curiosity.

4 Likes

Without any guarantees: it would also probably help if you provided unoptimized single threaded code from another language, or even pseudocode. It’s really hard to accurately translate highly optimized code from a foreign language…

Once we have a simple working Julia version, we can start to optimize.

6 Likes

Here’s a single threaded version done in Crystal, whose syntax is really easy to follow.
The only difference is it doesn’t spawn the threads but run them serial, so its slower.

FYI

After watching this presentation (and some referenced links) on using cpu resources more efficiently I was able to increase performance by just changing the sequence of a few lines of code (loc).

The goal is to use as many cpu execution units (EU) simultaneously as possible.
So you order sequential lines of code that can be executed independently.
Here’ one snippet from the Crystal code.

Original
def nextp_init(rhi, kmin, modpg, primes, resinvrs)
  nextp = Slice(UInt64).new(primes.size*2)
  r_hi, r_lo = rhi, rhi - 2
  primes.each_with_index do |prime, j|
    k = (prime - 2) // modpg 
    r = (prime - 2) %  modpg + 2 
    r_inv = resinvrs[r].to_u64   
    ri = (r_inv * r_lo - 2) % modpg + 2
    ki = k * (prime + ri) + (r * ri - 2) // modpg 
    ki < kmin ? (ki = (kmin - ki) % prime; ki = prime - ki if ki > 0) : (ki -= kmin)
    nextp[j << 1] = ki.to_u64 
    ri = (r_inv * r_hi - 2) % modpg + 2 
    ki = k * (prime + ri) + (r * ri - 2) // modpg
    ki < kmin ? (ki = (kmin - ki) % prime; ki = prime - ki if ki > 0) : (ki -= kmin)
    nextp[j << 1 | 1] = ki.to_u64
  end
  nextp
end

More Efficient/Faster
def nextp_init(rhi, kmin, modpg, primes, resinvrs)
  nextp = Slice(UInt64).new(primes.size*2)
  r_hi, r_lo = rhi, rhi - 2
  primes.each_with_index do |prime, j|
    k = (prime - 2) // modpg 
    r = (prime - 2) %  modpg + 2
    r_inv = resinvrs[r].to_u64 
    rl = (r_inv * r_lo - 2) % modpg + 2
    rh = (r_inv * r_hi - 2) % modpg + 2 
    kl = k * (prime + rl) + (r * rl - 2)
    kh = k * (prime + rh) + (r * rh - 2)
    kl < kmin ? (kl = (kmin - kl) % prime; kl = prime - kl if kl > 0) : (kl -= kmin)
    kh < kmin ? (kh = (kmin - kh) % prime; kh = prime - kh if kh > 0) : (kh -= kmin)
    nextp[j << 1] = kl.to_u64 
    nextp[j << 1 | 1] = kh.to_u64 
  end
  nextp
end

Changing the code sequence such that the next loc isn’t dependent on input from the previous allows two EUs to run these instructions simultaneously. This one change in this one routine appreciably increased total runtime speed.

Here’s another snippet example from the same program.

Original
    primes.each_with_index do |prime, j| 
      k = nextp.to_unsafe[j << 1]        
      while k < kn                       
        seg[k >> s] |= 1_u64 << (k & bmask)
        k += prime  end                  
      nextp.to_unsafe[j << 1] = k - kn   
                                         
      k = nextp.to_unsafe[j << 1 | 1]    
      while k < kn                       
        seg[k >> s] |= 1_u64 << (k & bmask)
        k += prime  end                   
      nextp.to_unsafe[j << 1 | 1] = k - kn
    end                                   

More Efficient/Faster
    primes.each_with_index do |prime, j|
      kl = nextp.to_unsafe[j << 1]      
      kh = nextp.to_unsafe[j << 1 | 1]  
                                        
      while kl < kn                     
        seg[kl >> s] |= 1_u64 << (kl & bmask)
        kl += prime  end
                        
      while kh < kn     
        seg[kh >> s] |= 1_u64 << (kh & bmask)
        kh += prime  end                
        
      nextp.to_unsafe[j << 1] = kl - kn 
      nextp.to_unsafe[j << 1 | 1] = kh - kn 
    end 

I’ve also seen the gcc compiled versions for C++|Nim show a bigger speed improvement (execution time reduction) than for the llvm ones, except for Rust which also sees significant improvement.

I don’t know how applicable this is to Julia, but I’m in the process of updating all the language versions of my code to get these FREE speedups by just writing the code better.

I’m not willing to wade through the PDF (I despise PDF, it was obsolete two decades ago, in a paperless world who needs a page description language) or the bits of code I saw.

This Julia program

will keep finding prime numbers until you stop it and when run again it will pick up where it left off. The program pretty much only uses integer addition and arithmetic comparison. I doubt I’m the first to implement it. I don’t know if it has a name but I think if it as an inside out Siege of Sieve of Eratosthenes.

I don’t run it very often, but so far it has found all primed up to 35249497.

You were asking about prime number algorithms in Julia so I thought I’d speak up.

This is probably the first Julia program I ever wrote.

1 Like

There are around 200 implementations on Rosetta Code alone

https://rosettacode.org/wiki/Sieve_of_Eratosthenes

And a CSP version is illustrated in Hoare’s “Communicating Sequential Processes,” Communications of the ACM 21(8) (August 1978), 666-677.

The code is a prime (ha!) example of how one can use Channels to make better code.

https://swtch.com/~rsc/thread

2 Likes

The algorithm I used is not Sieve of Eratosthenes, which requires that an upper bound be specified for the primes one is looking for.

I’ve not seen or heard about the approach I took. It is pretty simple though, so I expect it’s been done before.

It counts up the positive integers. It maintains a list of, because I couldn’t think of a good name, sieve objects, one for each prime it has found so far. Each of these holds one prime number and it’s greatest multiple that is less than the integer under current consideration.

The current integer is tested against each sieve to see if it is equal to that sieve’s next prime. If so, a new sieve object is created for that prime and we proceed to the next integer. I don’t recall the edge cases but IIRC it doesn’t need to test every sieve against every integer.

I think I first implemented this in Dylan (probably my only Dylan program), but that version held the sieve objects in a priority queue ordered by next multiple. I restructured the algorithm to not need a priority queue in the Julia version.

Maybe one of you knows what this algorithm is called.

I don’t think the “upper bound” thing is the defining characteristic of SoE. Your algorithm is what Wikipedia/sieve_of_eratosthenes calls the “incremental sieve”.

1 Like

Great. Thanks Gustaphe.