I have a small collection of tree-based collectives that I’m using for some of my projects. I’m new to distributed algorithms and wanted to get some thoughts on the code I’ve written and the profiling I’ve done in hopes of writing something that’d be useful to others. I’ve been working on parallel graph algorithms using distributed SPDM programs. I had some issues running MPI with our servers and since sought a pure julia solution.
I started working with DistributedArray’s SPDM module, but wanted something with tree based collectives. I’ve seen others mentioning Distributed.jl didn’t have a tree-based reduce and tried my hand at an some methods that emulate
receive by putting and taking from RemoteChannels. Channels are precomputed on the spawning node (pid = 1) and passed to the processors when @spawn is called. Then each process uses the pre-computed channels (passed in as arguments) to send messages during the program’s execution.
The code currently includes:
Exclusive Prefix Scan (Blelloch Scan)
each method, has a standard and profiling implementation and correctness testing. The algorithms implemented are tree based (hypercube) collectives from section 4 of Introduction to Parallel Computing (2nd edit.) and discussions with Prof. Ananth Grama. All methods work for any number of processors rather than clusters with just 2^k nodes, and RemoteChannels are parameterized by message type.
I have digests of runtimes of the available collectives on our shared memory systems. I’ve tested on a single NUMA node (with ClusterManager.jl’s LocalAffinityManager and numactl) and across 4 of our servers using the SSHManager, each with 32 processes. I haven’t tested against other methods currently because I’m still determining the best server environment to test these methods with. As a reference point I’ve included a naive for loop spawning and fetching/sending data from the workers sequentially.
Currently these plots serve to inform users of a conservative lower bound of how much serial work each processor needs to be done to make the communication worth it. They report the total communication building time + the fetch_spawn time from the perspective of the spawning node. The actual communication time of the workers will be shorter than that of the spawning node.
Local NUMA Node experiments:
- Utilize a Intel Xeon CPU E5 2690 v4 @ 2.6Ghz processor which has 14 physical cores (each emulating 2 hyperthreaded cores).
- number of processes: p = 2,4,8,16,27
- seeded square matrix message sizes: n = 1,10,100,1000,10k
- (larger than p = 16, n = 10k) Personalized All-to-All naive experiments trigger errors on our servers and thus n=10^4 results are omitted.
Remote Experiments are connected from a fifth server and with the SSHManager
- 32 processes per 4 servers of…
- Xeon E5-2667,
- Xeon E7-8867 v3,
- Xeon E5-2670
- and (the aforementioned) Xeon E5-2690 v4 processors.
- The servers have different number of processors, but the smallest runs a single Xeon E5-2670 CPU, which has 32 hyperthreaded cores.
- 32 processes per 4 servers of…
- number of processes: p = 32,64,96,128
- seeded square matrix message sizes: n = 1,10,100,1000
- Personalized All-to-All (PA2A) are ommitted due to some socket issues I’m having running the full experiment drivers. The methods run for individual parameters, but the full experiment triggers socket errors. I plan to fix the plotting code to include the RCC.jl PA2A profiling soon.
Ive included a naive for loop with my results because serialized communication won’t trigger NUMA Node’s known congestion issues (section 1.5.3, B. Lepers’ Thesis). It seems I’m still seeing bandwidth issues across the TCP connection that Distributed.jl is setting up.
Currently the gather implementation seems to have the worst performance.
Process level performance profiling shows that the processors further up the tree are spending significantly more time sending their (larger) messages, and the runtime is longer than having the spawning process fetch messages generated by the workers. These figures are reporting the same data as my previous plots, but also include the runtimes of each of the workers. I have these figures for all of the collective operations, but didn’t want to inundate the (already lengthy) post.
The code hasn’t been heavily optimized beyond ensuring type stability and I have an ever growing list of how I want to improve the code. So far I’ve sought to produce basic profiling and testing components with minimal dependencies and will try to improve from here. I’d like to see if anyone has any resources/suggestions on how I can improve the current approach and my testing methodology.
On my to do list is to …
- Finish scatter
- Make my plotting code + experiment data available for others to use/scrutinize.
- Improve gather (method relies on amortized append! rather than preallocation).
- Eliminate the log(p) factor in some method’s isoefficiency.
- improve documentation
- Construct RemoteChannels concurrently on workers.
- Incorporate SharedArrays
- Create NUMA aware strategies
- Improve power of two batch reduction methods