I am working on a fairly big code base written in Julia 1.1 and I am experiencing erratic issues with parallel processing. It is very annoying as most of the time everything works perfectly.
The application is mainly composed by a long list of tasks distributed with pmap, some of them make use of distributed aggregation similar to this
result = @distributed (+) for task_id in task_ids
just once every many calculation I get this error from the processes that make use the aggregation code above:
peer 36 didn’t connect to 27 within 180.0 seconds peer 36 didn’t connect to 28 within 180.0 seconds peer 36 didn’t connect to 13 within 180.0 seconds peer 36 didn’t connect to 29 within 180.0 seconds
…and so on. You might have noticed that I have already extended the JULIA_WORKER_TIMEOUT to accomodate longer executions times. That did not help much. All the processes run on the same machine.
If I resubmit the failing task, that gets processed correctly. I know it is a long shot with the little information given, however any idea/recommendation on how to investigate or where to look at would be very appreciated. Thanks!
How many workers do you have connected to a single master at one time?
36 workers for a single master on a 24 cores machine
What topology are you using (if not using the default)?
Is there high network congestion/latency between the master and workers (or between workers)?
That can be the case as many workers do rw disk access
Are some/all of your workers doing other (Julia) things in the same process when one of these @distributed calls is made?
I am not sure if I understand the question correctly - however, all the workers work independently, so worker A and B can be busy with one of the tasks and maybe both are busy with the @distributed calls while C, D etc can be doing nothing or fulfill different tasks
Since you end up connecting to the same node as the master (instead of going over the network), this is probably not the problem. Although, if the workers do connect over the network for some reason, then a sufficiently high amount of interface traffic could cause enough dropped packets to cause a failed connection. You should double-check (somehow?) that the master and workers connect to each other without traversing the physical network.
My point is that, assuming the error is probably not generated by a failed connection at the OS network layer (which automatically handles the accepting acknowledgement replies of an accepted connection, preventing timeouts in a non-congested network), then it may be that a Julia process is working on something else intently (say, a huge matrix-matrix multiply) and doesn’t even get the chance to yield to answer the message from the master.
In this case, it would help to get a better idea of the sorts of code your workers are running. Are they large mathematical operations without I/O (which won’t yield)? Long-running code which doesn’t do any I/O won’t make any calls to yield, which starves the worker from being able to service requests from the master (since all of this runs on a single thread).
I have been able to simplify my code to something human readable and forum friendly
println("worker timeout $(Distributed.worker_timeout())") #default 60 secs
@everywhere function job_worker(some_number)
if(myid() == 3)
rand(1_000_000,1) * some_number
@everywhere function aggregate(job_id)
@distributed (+) for some_number in collect(1:10)
I tried to replace the the intensive part of my code to something completely blocking (see the sleep call), however this is not enough to reproduce the issue
Maybe should I look somewhere else, or maybe this is not the right way to replicate it
So, I would switch out those println and sleep calls for large operations like matrix multiplies (and probably pre-allocate the arrays). Both println and sleep yield to the scheduler, so they’re not equivalent to a long-running, non-yielding computation that might appropriately starve Julia’s scheduler.