addprocs(["r4i6n10.icex.cluster";"r5i0n9.icex.cluster"],tunnel=true,topology=:all_to_all)
ERROR: connect: connection refused (ECONNREFUSED)
in yieldto(::Task, ::ANY) at ./event.jl:136
in wait() at ./event.jl:169
in wait(::Condition) at ./event.jl:27
in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
in wait_connected(::TCPSocket) at ./stream.jl:265
in connect at ./stream.jl:960 [inlined]
in connect_to_worker(::SubString{String}, ::Int16) at ./managers.jl:483
in connect_w2w(::Int64, ::WorkerConfig) at ./managers.jl:446
in connect(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./managers.jl:380
in connect_to_peer(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./multi.jl:1479
in (::Base.##637#639)() at ./task.jl:360
Error [connect: connection refused (ECONNREFUSED)] on 3 while connecting to peer 2. Exiting.
Worker 3 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
Ramin, I see an SGI ICE cluster and I claim my five pounds.
Rack 5 enclosure 0 node 9
A stupid question - can you log into r4i6n10 and log into r5i0n9
If you are using Slurm there is a PAM module which stops you sshing into a node if you are not running a job.
But that will not stop you sshing between nodes. So this is irrelevant.
I’ve seen this before when the nodes run out of connections. The issue with all_to_all connectivity is that you have n^2 connections being made, as each node is interconnected with each other.
If you don’t need your workers communicating with each other directly, reconsider your use of master_slave.
The answer to the ssh question is yes, I can do that.
However, you are onto something Upon further investigation yesterday, the error only happens if I add nodes on different racks with :all_to_all. I can add as many connections as I want on the same rack (so I don’t think it is a number of connections issue).
The error in fact happens after the ssh is successful because I see outputs that indicate that the ssh was successful.
I unfortunately need all_to_all for my application (plenty of communication between procs). Please see answer below on why I don’t think it is the number of connections that is the culprit.
Ramin, remember the network topology of SGI systems.
The Infiniband network goes onto all compute nodes of course.
(For anyone not familiar with these systems - there are no managed switches. Each ‘irc’ or blade chassis has two or four dumb switch blades. These are connected into a hypercube)
The ethernet connections are taken within one rack from the rack leader node at the top of the rack (probabbly two rack leaders as you have an ICEX).
I THINK that you may be seeing this - the Ethernet connections are not linked between the racks in your setup.
(Then again - I thought there should be a conneciton linking each rack, and in the setups I used to manage there was I am sure).
Anyway, you should maybe be using the Infiniband ib0 interface names, ie the hostnames associated with the IPoverIB interfaces called ib0 on each node.
I cannot log into an ICE cluster, but I am pretty sure there is a simple mapping between the host names.
I apologise if I am setting you off on the wrong path.
Rami
I was wrong.
I am correct in saying that the Gigabit Ethernet is only available internal the the rack. this is th ebehaviour you see.
However the r4i6n10 hostname should be the one for the Infiniband ib0 interface
Can you run an ‘ip addr’ on one of those nodes please?
Somehow I think you are getting the ethernet interfaces and not the ib0 - which you DO want !
The GBE VLAN is entirely internal to each rack (see Figure 1-7). The naming scheme is replicated between each rack, so the name i2n4-eth (identifying the VLAN_GBE interface on IRU 2, node 4) may match several different nodes, but only ever one in each rack. To identify a node uniquely, use the rXiYnZ syntax.
Blade rXiYnZ names are resolvable via DNS. They get the A record for the -ib0 address. The rXiYnZ-ib0 name is a CNAME to the rXiYnZ address. For example:
sorry to be hijacking this thread.
What you are doing should be correct. However you might have to use the -ib0 hostnames. See:
InfiniBand Network
The InfiniBand fabric is connected to service nodes, rack leader controllers (leader nodes), and compute nodes, but not to the system admin controller (admin node) or CMCs. Table 1-7 shows InfiniBand names. There are two IB connections to each of the nodes that use it. Since IB is not local to each rack, you must use the fully-qualified, system-unique node name when specifying a destination interface. It may be necessary to alias the rXiYnZ names (currently non-resolvable) to rXiYnZ-ib0 if this is needed by MPI. Technically, rXiYnZ from a leader node points at the VLAN_GBE interface for the compute blade while from a service or compute blade, rXiYnZ points to the ib0 interface.
Rami,
Would you please do something quite simple?
On the cluster login node: nslookup r4i6n10 (or use the ‘dig’ tool)
Log into r5i0n9 and run the same command. Do you get the SAME IP address returned?
I wich I had access to an ICE cluster, but I think you may have the Infiniband interface being returned in one case.
But when you ssh into the compute node you are getting the ethernet address returned.
SGI support are excellent, and you should involve them. I realise this is HPE support nowadays!
We would be interested in learning more about your application, and the all-to-all communication.
I would be interested to find out how it runs on an SGI cluster when we do get it to run!
As an aside topic, there is a very useful Python package called python-hostlist which is used for cases like this where you might want to ‘translate’ hostnames between nodename and nodename-ib0 or to expand lists of hostnames. https://www.nsc.liu.se/~kent/python-hostlist/
Note to self - perhaps a Julia equivalent might be a good project.
The application is a domain decomposition for finite difference. Strictly speaking, the connection does not need to be :all_to_all since only the procs that share a boundary need to communicate. However, the book-keeping gets tedious in trying to construct custom typologies for that.
Right now, it is not scaling all that well, but I am not really an expert on domain decomposition. I have a few implementations and testing how they scale. I managed to figure out how to get bsub to reserve nodes on one rack, a shameless hack for a few nodes, I am hoping for a more permanent solution
If I understood the documentation correctly ssh manager uses TCP connections between the workers and cannot take advantage of the fast infiniband connections (unless a cluster manager with custom transport is defined). I wrote this little piece of code to show colleagues how you may go about writing domain decomposition in Julia.
I have a question for you, is my understanding correct? and if so, how hard would it be to write such a manager?
Rami, that is not a shameless hack by any means. In fact, people often do that - if you run a parallel code on servers which are on the same switch, or are ‘close’ to each other in network terms you get better performance. I applied this on the ICE clusters I managed in Formula 1. We ran one job contained in an iru. In PBS this is known as ‘bladesets’
I also implemented it on a large NUMA machine - using the scheduler to place the processes close to each other.
I agree that you will want to run on more than one rack.
Well… you are using Infiniband I hope - just not native Infiniband.
To explain, an Infiniband network is set up and runs on the cluster. The Ip over Infiniband protocol is used, to create IP interfaces connected to the Infiniband cards - these are the interfaces which you see as named ‘ib0’
It is not necessary to have IPoIB set up for Infiniband to work, or the addresses assigned.
However if the communication is ssh - then yes you would need it!
It is also worth borrowing some terminology from the MPI world (I am no MPI guru I must add).
You can have a ‘launcher’ - which starts all the processes on the remote machine. this can be ssh based.
The actual interprocess communication can use a variety of transport layers - BTLs in the jargon of OpenMPI.
These can be shared memory segments (on a node), TCP, or native Infiniband or native Omnipath transports.
wuld a Julia expert care to comment here please and help me dig myself out of the many holes I just dug in parallel?