How is data routed between distributed processes?

Say there are two machines, the local L and remote R. I start my “director” julia process (myid() == 1) on machine L. Then the director process (L.1) spawns two worker processes R.2, R.3 on R with addprocs([("R", 2)]). Because the cluster topology is (by default) lazy all_to_all, L.1 can request R.2 to request R.3’s id with@fetchfrom 2 @fetchfrom 3 myid(). Now, my question is what’s the path that data takes to get back to the director?

I would want/naively expect these processes to take advantage of being on the same machine. In other words, the communication R.3 → R.2 stays within R much like how communication between local processes happens in the same machine. But my gut says that’s not how that works because the remote processes “don’t know” they’re on the same machine (unless they do?).

So my next guess would be that all data is communicated between workers e.g. R.2, R.3 by first going through the director e.g. L.1 since “it knows where everything is” as in a star network. But that also significantly increases the amount of network traffic and should tank performance if these remote procedure calls (RPCs) are being made frequently.

But, why have this setup to begin with? You should just use one process and multiple threads if it’s on the same machine.” I hear you say. The original idea was to have a tree topology that each machine has its own “middle manager” process which distributes work given by the “director” to “(true) workers”, each of which has its own GPU. Then, once calculations are done on the GPU, each true-worker process then sends its data to its manager, which then collects all the data and sends it back up to the director. This was done under the first “topological model” mentioned above to reduce network communication and improve performance, but now I’m not so sure that’s the case, and after “cutting out middle management” lead to significant performance increases (who woulda thunk?), I feel my suspicions are correct.

Whenever I get the chance to rewrite from scratch, multiple threads will be used on each remote process for multiple GPUs.

So overall, I guess it’s more of a general question on the effects of having a cluster topology that’s not just a star graph.