Error when using "too many" workers


#1

Hi, I have been running code in parallel using julia -p200 for a while. Now I am using a cloud server that has:

CPU(s)                        96
Cores Per Socket        24
Sockets                       2

By my calculations I should be able to run 96x24x2=4608 threads. However when trying to run something like julia -p750 (I can run julia -p500) for example I get the following error:

brett_israelsen@instance-1:~/GitProjects/self_confidence/road_net$ julia -p4600
ERROR (unhandled task failure): pipe_link: too many open files (EMFILE)
Stacktrace:
 [1] setup_stdio(::Base.##374#375{Cmd}, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./process.jl:497
 [2] #spawn#373(::Nullable{Base.ProcessChain}, ::Function, ::Cmd, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./process.jl:511
 [3] (::Base.#kw##spawn)(::Array{Any,1}, ::Base.#spawn, ::Cmd, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./<missing>:0
 [4] #spawn#370(::Nullable{Base.ProcessChain}, ::Function, ::Base.CmdRedirect, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./process.jl:392
 [5] spawn(::Base.CmdRedirect, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./process.jl:392
 [6] (::Base.Distributed.##31#34{Base.Distributed.LocalManager,Dict{Any,Any},Array{WorkerConfig,1},Condition})() at ./event.jl:73
Master process (id 1) could not connect within 60.0 seconds.                                                    
exiting.
Worker 465 terminated.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
Stacktrace:
 [1] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73
Worker 464 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
Stacktrace:
 [1] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
.
.
.

I suppose that since I am fairly new to HPC, I may have unrealistic expectations, or could be doing something wrong. Hopefully someone can help me out, I would like to be able to use all of the available threads and am not sure how, or if I am already. Thanks!


#2

IIRC you need to use addprocs instead and set the topology to be something more sparse. @sbromberger might know.


#3

Judging from the error message you have a ulimit of 512 (check with ulimit -a). See here for how to increase it (no clue if that works in your environment though).


#4

Yup; I ran into this also. The default topology is “all-to-all”, which means, for N workers, you’re opening up N^2 files / sockets / whatever you’re using to communicate . If your workers don’t need to communicate with each other, consider using :master_slave instead (via addprocs(N, topology=:master_slave)).

IMO we should reconsider having :all_to_all as a default topology. I seem to recall a github issue on it.


#5

For the sake of leaving a trail for future readers, this is the output of ulimit -a. I can’t really see the 512 you mentioned (unless is is the pipe size, which I doubt). However there is a file limit of 1024, which might be related from the too many open files error.

brett_israelsen@instance-1:~/GitProjects/self_confidence/road_net$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 347860
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 347860
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Looks like I can manually increase the limit for open files via ulimit -n NEW_LIMIT, but according to @sbromberger using :master_slave might be the only practical solution based on the N^2 relationship.


#6

So, does addprocs(N, topology=:master_slave) work in the same way as julia -p N? besides replacing the default :all_to_all mode?


#7

Ah, ok. The 512 was just a guess, but I didn’t know about the N^2 relationship either. The relevant limit might be

pending signals                 (-i) 347860

then considering that 500^2=250000 < 347860 < 750^2 = 562500.


#8

That’s my understanding.


#10

Most cloud servers will have either two or four CPUs (chips). Each CPU chip resides in a single “socket” and has some number of cores on the chip. The total number of cores is therefore the number of CPUs times the number of cores per CPU. It sounds like you have two CPUs (sockets) with 24 cores per CPU, for a total of 48 cores; this would be a common, realistic, albeit somewhat high-end configuration for a cloud server.

For most types of CPUs (and, in particular, Intel Xeon CPUs) can support Hyperthreading. Hyperthreading allows each core to execute two threads (programs or different parts of the same program) at once, interleaved with each other. Hyperthreading usually speeds up performance, but there are some cases where it makes performance worse. On the system you describe, I suspect you have a total of 96 hypertheads on the 48 cores, and this is where the number “CPU(s) 96” comes from in your post above. Sometimes hyperthreads get referred to as “cores”, although this is not technically accurate; Amazon EC2 refers to hyperthreads as “vCPUs”.

So I suspect that what you really want to do is run “julia -p96”. Anything more may overload the cores with too many workers/threads, which will cause performance to suffer as each core attempts to switch between many tasks, which causes a large loss of efficiency due to the overhead of doing the switches.

One exception to what I say above is if you are using multiple clouds servers (nodes) together with MPI and a machinefile: in this case, you’ll have more cores available in total, but the different cloud servers will have to communicate with each other across the network, which adds some overhead. You will need to run your Julia code differently if this is the case; let us know if it is the case and if you’d like details.

Another exception to what I say above is if the cloud server you are using has one or more Intel Xeon Phi accelerator card in it (Knights Corner / Knight’s Landing). Having such a card would be very unusual, but can provide more cores on a single machine than would otherwise be possible.

I hope this helps.


#11

I believe this is the issue you are referring to.


#12

That’s the one, and here’s some additional discussion.. My plea for backporting to 0.6 fell on deaf (or unsympathetic) ears.