Parallel loop in numeric order

Running a loop is in numeric order
for i=1:100

Is it possible to run a parallel loop in order?
@sync @distributed for i=1:100

I figure it is not fully possible to fully run it in order, but at least batchwise. Say I have 4 workers; then something like


… in any order

should be somehow possible when workers are synchronized, or?

what’s the goal here? if your loops affect each other, then you shouldn’t run them in parallel, if you just want the results to be ordered, you can order them later (e.g put result into per-allocated variable) or pmap


I want to find the lowest n for some condition to be true. So naturally I would do a loop/while loop for n=1,2,3,… and break/return when i found my n. In principle each check for given n is independent, so parallelizing would make it much more efficient. However, when looping up to some largest N in parallel the checking is not done in order, so I would have to do all checks from 1 to N and then look for the lowest n which is not optimal in my opinion. I would have thought that with @sync this is somehow possible.

There is inevitably some overhead in parallel programs. It might be just to synchronize the workers, might be something more complicated, like taking a few values from workers and reducing them somehow (summing, taking extrema etc.)

In your case, the strategy to reduce the overhead depends on what you know of the problem.
If your predicate is false for all n < n₀ and true for all n >= n₀, then checks made in strides should be a good choice: process 1 checks numbers 1, p+1, 2p+1 etc., process 2 checks 2, p+2, …, where p is the number of processes.
If the predicate is only true for some random numbers and there may be no assumptions how many such numbers there are and how they are distributed, then there’s not much room for optimization.

Hey and thanks. So I just thought of something like this

@everywhere function test(x,n)

test(x,n) gives some boolean (just true or false) for integer n and input x. So is it possible to do it like this?

while true
@spawn r1=test(x,n+1)
@spawn r2=test(x,n+2)
@spawn r3=test(x,n+3)
@spawn r4=test(x,n+4)

if ( fetch(r1) || fetch(r2) || fetch(r3) || fetch(r4) )
return blablabla

I’m not very familiar with this spawn/fetch thing, but I figured it should be possible this way, or?

1 Like

It is technically possible, but that would (probably) be inefficient.
You need some significant work done on a thread / worker before any synchronization, otherwise spawning and synchronization will eliminate the potential gains from parallelism.
I’d go like that:

@everywhere function test(x, n)

@everywhere function test_range(x, r)
    for n in r
        if test(x, n)
            return n
    return missing

function world_min(x, nmax)
    work_procs = workers()

    work_tasks = [@spawnat(work_procs[i], test_range($x, $i:$work_procs:$nmax))
                  for i in eachindex(work_procs)]

    min_n_on_workers = fetch.(work_tasks)

    return minimum(n for n in min_n_on_workers if n !== missing)

Still thinking about your code. So you are doing it up to some nmax? Does it mean it is calculating up to nmax and then checking without prior abortion when n is found?

What does



Do you mean



Workers check non-overlapping ranges with the step equal to nworkers() (that is what must be in place of $work_procs in the comprehension expression), up to some nmax. Each worker returns a value as soon as it has found one (this is what test_range function does). The master process’ task is to fetch all those values and take the minimum of them.

In principle, there may be no prior knowledge of nmax, but then it can be set to typemax(Int) or something, or the computation may be done repeatedly for smaller consecutive ranges. Your code in the 4th message is an extreme case when those consecutive ranges have the length nworkers().

The thing to consider is how to make benefits from parallel execution outweigh the overhead.
Each worker has to

  • receive its task (a few 1000s of clock cycles)
  • make its part of shared work
  • send the result to the receiver (another few 1000s of clock cycles)

So, the workload must be “large”, otherwise most of the time of the parallel execution will be spent on sending data between processes.

1 Like