Productive Scalable Distributed Task Scheduling Using an MPI-based Backend for Dagger

yanzin00 · August 31, 2025, 9:23pm

Hello Julia Community, especially Dagger and HPC developers. I hope you are doing well this summer and have achieved your goals! I’m here to share:

Dagger’s MPI Implementation status
My Google Summer of Code at Julia conclusion
Some great data visualization about Dagger’s MPI Implementation

I would like to express my heartfelt gratitude for the incredible mentorship and support that @fda-tome and @jpsamaroo provided me during this summer. Their support enabled me to work at MIT CSAIL in the Julia Lab, which was a wonderful experience. I also had the pleasure of meeting @evelyne-ringoot and @rayegun, who welcomed me warmly. I truly appreciated the opportunity to work with the Julia language.

A Few Things I Forgot to Mention

In my previous update Faster MPI Integration in Dagger, I mentioned the core functionality of our MPI implementation. I want to add that the communication layer is built on nonblocking MPI with cooperative progress. This allows asynchronous sends and receives to be polled while yielding, which naturally overlaps computation with communication and lets other Julia tasks run simultaneously.
The ultimate goal of this project is to extend Dagger’s capabilities to HPC clusters with low-latency interconnects, as its default TCP communication is a limiting factor in these environments. The core of our solution is Dagger’s MPI Implementation, which replaces the TCP transport with MPI-aware task placement and data movement.

Benchmark setup

The performance of Dagger’s MPI Implementation was evaluated against Dagger’s TCP backend using Cholesky decomposition as a parallel benchmark. The experiments were conducted on a high-memory cloud instance (AWS r5d.24xlarge featuring 96 CPUs and 768 GB of RAM).

Evaluation

Two types of scaling experiments were performed to evaluate the system’s performance closely:

Strong scaling examined the performance of a fixed 18000 x 18000 matrix workload distributed across multiple processors, aimed at testing the framework’s ability to allocate a constant workload among the cores.
Weak scaling analyzed performance with a workload that increased in size proportionally to the number of processors, varying from 1 to 81, to gain insight into how performance evolved as both the workload and computing capacity expanded concurrently.

Figure 1: Strong and weak scaling execution times versus the number of processors with MPI and TCP backends using Float32 data type.

Figure 2: Strong and weak scaling GFLOPS versus the number of processors with MPI and TCP backends using Float32 data type.

Our evaluation, summarized in Figures 1 and 2, provides an initial view of the performance trade-offs between MPI and TCP backends. In execution time Figure 1, MPI achieves lower runtimes at small to medium process counts but degrades as communication overheads grow, while TCP remains nearly constant, reflecting its lack of distributed scaling. In strong scaling Figure 2, MPI initially sustains higher throughput at 2 processors, but efficiency decreases rapidly as process counts grow, while TCP remains more stable. The initial results, however, successfully demonstrate Dagger’s ability to leverage HPC interconnects for parallel workloads.

Next Steps and How to Contribute

We are still working to fully optimize the MPI implementation to unlock its full potential. Unfortunately, we couldn’t complete the RMA implementation for broadcasting, as we focused on establishing a solid benchmark setup to gather data for our ACM SC25 poster. We are eagerly awaiting the results!

Moving forward, I’ll continue to work on the RMA implementation and profile our MPI code to identify and reduce communication overhead.

If you’d like to contribute, you can find the implementation in the yg/faster-mpi branch. We are cleaning up the code to get it ready for merging. Feel free to dive into the Dagger documentation and explore our MPI code, especially the point-to-point and broadcast communication implementations. Any issues or pull requests are welcome on the Dagger repository on GitHub.

ufechner7 · August 31, 2025, 9:46pm

I cannot see this in the figure. In the left figure TCP has 40GFLOPS with 25 processors, and MPI has close to zero GFLOPS.

Perhaps there is a mistake in the figures?

yanzin00 · August 31, 2025, 10:44pm

Thanks for pointing that out! You’re right, in the weak scaling plot, TCP clearly achieves higher GFLOPS than MPI. My statement was referring specifically to the initial part of the strong scaling results (with 2 processors), where MPI achieves higher throughput before communication overheads start to dominate. I’ll make that distinction clearer in the text to avoid confusion.

jblaschke · September 8, 2025, 2:57am

Why is MPI so much slower than TCP? Can you share some details about your MPI setup? Also this looks like you’re running single-node only tests, or am I missing something?

yanzin00 · September 8, 2025, 4:45pm

We are currently profiling our MPI implementation in Dagger using Felipe’s profiling setup. Initial results for strong scaling with 16 ranks show some idle time (white space) that needs to be addressed. One key issue is that Datadeps processes the DAG serially on all ranks, so larger DAGs naturally incur more overhead. Additionally, the backend is still largely JIT-compiled rather than ahead-of-time (AOT), so overhead appears scattered throughout the runtime rather than concentrated at startup. If you are interested in the MPI setup, you can see it in the MPI code.

Topic		Replies	Views
Mentor for HPC GSoC project Community	2	822	February 28, 2017
Dagger, No speed increase, parallel computing General Usage	5	1765	January 26, 2018
Julia vs Python's Dask: Known speed comparisons? Julia at Scale question , parallel , distributed	13	5099	October 15, 2019
Faster MPI Integration in Dagger Community announcement , parallel , mpi , dagger	9	464	July 27, 2025
Help Optimizing Slow Distributed Array Computation Performance distributed	7	1938	November 14, 2018

Productive Scalable Distributed Task Scheduling Using an MPI-based Backend for Dagger

A Few Things I Forgot to Mention

Benchmark setup

Evaluation

Next Steps and How to Contribute

Related topics