Setup topology with ClusterManagers

How does one pass topology to a cluster manager? Julia docs mention the setup of topology in the case when addprocs dispatches on machines, but not on a cluster manager.

I am trying to do the following thing.

  • I would like to create a job (with multiple processes) for each of the computing nodes I require via addprocs(manager).
  • For each of the jobs, I would like to have a master with a bunch of workers
  • The parent process that ran addprocs should connect only to the master process in each of the nodes.

In essence, I want to create a “chain of command”: parent process would give instructions to the master processes on each of the nodes; these masters would give instructions to their respective workers.

The reason why I am trying to do that is the following: I need to share the structures between the workers from the same computing node.

1 Like

With addprocs(manager) you will create one master and add n workers. For a master-worker topology use the topology kwarg. For example, if using Slurm Manager then something like

addprocs(SlurmManager(n_workers), N=m_nodes, topology=:master_worker, exeflags="--project=.")

I don’t know if it’s currently possible to do a “chain of command” where you have the parent process spawn other “parent processes” that then spawn workers.

@affans, it seems that currently it is impossible to create workers from other workers.

On the other hand, I would simply like to change the topology. Master-worker topology implies that all the workers connect only to the master process which ran addprocs. I would like to have only one worker to be connected to the master process while the other workers connect to this worker.

I have looked at the docs, and it seems I don’t understand at all how does one changes topology.
Can anyone point out some guide where it is discussed in detail how to setup custom topology? Some good examples?

I don’t think that’s currently possible with the available topologies. What you could do is set up a custom topology… from the documentation:

The launch method of the cluster manager specifies the connection topology via fields ident and connect_idents in WorkerConfig. A worker with a cluster manager identity ident will connect to all workers specified in connect_idents.

See [Distributed Computing · The Julia Language]ave(Distributed Computing · The Julia Language) You’ll probably have to read the source code for how the existing topologies are setup to properly understand how to implement a custom one.

Also, do I understand correctly that ClusterManagers imply that workers talk via ssh or TCP/IP connection?
Is it possible to force the workers from the same computing node to talk directly to each other? (May be via some clever specification of topology?)

Yes.

Is it possible to force the workers from the same computing node to talk directly to each other? (May be via some clever specification of topology?)

Again probably some custom topology but this is out of my expertise. I often just use the master_worker topology

That may be quite problematic. I hoped someone has bumped into this problem and has written a how-to guide…

all_to_all with lazy=true would work if the workers on each node only communicate on one another - see here and here.

What type of connections does it use by default? Will it use different types of connection if two workers are on the same node or if they are on two different nodes?

Also, can someone clarify what are the available types of connections between workers?
The docs mention ssh and TCP/IP. Do I understand correctly that these types are used when we have physically two different nodes connected either via InfiniBand or via internet?
What kind of connection is used when the two workers are on the same node?