How is data routed between distributed processes?

NonDairyNeutrino · May 17, 2025, 3:59pm

Say there are two machines, the local L and remote R. I start my “director” julia process (myid() == 1) on machine L. Then the director process (L.1) spawns two worker processes R.2, R.3 on R with addprocs([("R", 2)]). Because the cluster topology is (by default) lazy all_to_all, L.1 can request R.2 to request R.3’s id with@fetchfrom 2 @fetchfrom 3 myid(). Now, my question is what’s the path that data takes to get back to the director?

I would want/naively expect these processes to take advantage of being on the same machine. In other words, the communication R.3 → R.2 stays within R much like how communication between local processes happens in the same machine. But my gut says that’s not how that works because the remote processes “don’t know” they’re on the same machine (unless they do?).

So my next guess would be that all data is communicated between workers e.g. R.2, R.3 by first going through the director e.g. L.1 since “it knows where everything is” as in a star network. But that also significantly increases the amount of network traffic and should tank performance if these remote procedure calls (RPCs) are being made frequently.

“But, why have this setup to begin with? You should just use one process and multiple threads if it’s on the same machine.” I hear you say. The original idea was to have a tree topology that each machine has its own “middle manager” process which distributes work given by the “director” to “(true) workers”, each of which has its own GPU. Then, once calculations are done on the GPU, each true-worker process then sends its data to its manager, which then collects all the data and sends it back up to the director. This was done under the first “topological model” mentioned above to reduce network communication and improve performance, but now I’m not so sure that’s the case, and after “cutting out middle management” lead to significant performance increases (who woulda thunk?), I feel my suspicions are correct.

Whenever I get the chance to rewrite from scratch, multiple threads will be used on each remote process for multiple GPUs.

So overall, I guess it’s more of a general question on the effects of having a cluster topology that’s not just a star graph.

Topic		Replies	Views
Data exchange between Julia sessions (same or different machines) Julia at Scale question , distributed	10	1190	February 24, 2023
Two level distributed / parallel execution Julia at Scale question	4	1074	April 22, 2020
The ultimate guide to distributed computing Julia at Scale parallel , cluster , distributed	44	9864	June 21, 2021
What's under the hood of Julia's remote call procedure? Internals & Design distributed , rpc , remote	2	303	April 2, 2025
Trivial question about workers on a cluster Performance question , package , parallel	1	374	April 3, 2021

How is data routed between distributed processes?

Related topics