JULIA_WORKER_TIMEOUT error

I am working on a fairly big code base written in Julia 1.1 and I am experiencing erratic issues with parallel processing. It is very annoying as most of the time everything works perfectly.
The application is mainly composed by a long list of tasks distributed with pmap, some of them make use of distributed aggregation similar to this

result = @distributed (+) for task_id in task_ids
    aggregate_worker(task_id)
end

just once every many calculation I get this error from the processes that make use the aggregation code above:

peer 36 didn’t connect to 27 within 180.0 seconds
peer 36 didn’t connect to 28 within 180.0 seconds
peer 36 didn’t connect to 13 within 180.0 seconds
peer 36 didn’t connect to 29 within 180.0 seconds

…and so on. You might have noticed that I have already extended the JULIA_WORKER_TIMEOUT to accomodate longer executions times. That did not help much. All the processes run on the same machine.

If I resubmit the failing task, that gets processed correctly. I know it is a long shot with the little information given, however any idea/recommendation on how to investigate or where to look at would be very appreciated. Thanks!

A few questions:

  • How many workers do you have connected to a single master at one time?
  • What topology are you using (if not using the default)?
  • Is there high network congestion/latency between the master and workers (or between workers)?
  • Are some/all of your workers doing other (Julia) things in the same process when one of these @distributed calls is made?

@jpsamaroo thanks a lot for helping out.

I go in order:

  • How many workers do you have connected to a single master at one time?
    36 workers for a single master on a 24 cores machine

  • What topology are you using (if not using the default)?
    default topology

  • Is there high network congestion/latency between the master and workers (or between workers)?
    That can be the case as many workers do rw disk access

  • Are some/all of your workers doing other (Julia) things in the same process when one of these @distributed calls is made?
    I am not sure if I understand the question correctly - however, all the workers work independently, so worker A and B can be busy with one of the tasks and maybe both are busy with the @distributed calls while C, D etc can be doing nothing or fulfill different tasks
    e.g.

PMAP(TASKS[1,2,3])
→ A(TASK1)
—> @distributed
→ B(TASK2)
—> @distributed
→ C(do nothing)
→ D(TASK3)
→ E(do nothing)

Thanks again

Since you end up connecting to the same node as the master (instead of going over the network), this is probably not the problem. Although, if the workers do connect over the network for some reason, then a sufficiently high amount of interface traffic could cause enough dropped packets to cause a failed connection. You should double-check (somehow?) that the master and workers connect to each other without traversing the physical network.

My point is that, assuming the error is probably not generated by a failed connection at the OS network layer (which automatically handles the accepting acknowledgement replies of an accepted connection, preventing timeouts in a non-congested network), then it may be that a Julia process is working on something else intently (say, a huge matrix-matrix multiply) and doesn’t even get the chance to yield to answer the message from the master.

In this case, it would help to get a better idea of the sorts of code your workers are running. Are they large mathematical operations without I/O (which won’t yield)? Long-running code which doesn’t do any I/O won’t make any calls to yield, which starves the worker from being able to service requests from the master (since all of this runs on a single thread).

@jpsamaroo

On your first point, master and workers sit on the same machine - so I see how unlikely network can be the problem.

On your second point, I confirm there is some intense calculation going on there. I will try to create a mockup of the code so that you can take a look.
Thanks a lot!

Hello,
I have been able to simplify my code to something human readable and forum friendly :slight_smile:

using Distributed
addprocs(5)

println("worker timeout $(Distributed.worker_timeout())") #default 60 secs

@everywhere function job_worker(some_number)
	if(myid() == 3)
		println("block!")
		sleep(3000)
	end
    rand(1_000_000,1) * some_number	
end

@everywhere function aggregate(job_id)
    @distributed (+) for some_number in collect(1:10)
        job_worker(some_number)
    end
end

function do_job(jobs_list) 
    pmap(aggregate, jobs_list)
end

do_job(collect(1:1000))

I tried to replace the the intensive part of my code to something completely blocking (see the sleep call), however this is not enough to reproduce the issue
Maybe should I look somewhere else, or maybe this is not the right way to replicate it :thinking:

So, I would switch out those println and sleep calls for large operations like matrix multiplies (and probably pre-allocate the arrays). Both println and sleep yield to the scheduler, so they’re not equivalent to a long-running, non-yielding computation that might appropriately starve Julia’s scheduler.

@jpsamaroo

I have been able to replicate the issue with the following code:

@everywhere function job_worker(some_number)
	while myid() == 3
		i = some_number * some_number / some_number
	end

    rand(1_000_000,1) * some_number	
end

Basically, just an infinite loop with a simple operation in it. This clearly points me to the issue: I will refactor the operations I am executing in my application to not starve the workers.

Thanks a lot for you help :slight_smile:

2 Likes