How to profile Julia MPI code?

How do people usually profile Julia MPI code to find bottlenecks and reduce communication overhead?

As an MPI noob I thought there would be some profiler tools out (kinda like cuprof and NVIDIA Nsight) there but they seem to be mostly proprietary (e.g. TotalView, Arm MAP) and probably overkill for our use case since we’re not running with thousands of ranks yet.

Looks like there may be some open-source language-independent tools like http://www.hpctoolkit.org/ and GitHub - LLNL/mpiP: A light-weight MPI profiler. which I might try looking into.

EDIT: Right now just trying to profile CPU MPI code, but definitely looking to profile GPU/CUDA-aware MPI code in the near future.

2 Likes

I would love to see graphs like in this paper by Parry Husbands and Kathy Yelick: https://upc.lbl.gov/publications/husbands-lu-sc07.pdf Figures 4 and 5. I imagine one can easily instrument the MPI, added the timers, and then log it all - one could produce various nice visualizations like we do with our profilers. I think they wrote some custom tooling for the charts in that paper.

We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.

I think Dagger.jl has some of this tooling now. I am sure I have seen pictures of such MPI communication visualizers at conferences that make me feel like there are products that do this and I found this SO thread: openmpi - Visualizing communication pattern of MPI processes - Stack Overflow

-viral

2 Likes

If you have relatively simple patterns and small amount of ranks you can get decent results just by profiling a single rank. You can slap a TimerOutput on your MPI calls to get a nice output of whether communication is limiting. If you have load balancing issues you can also time the operations and compute eg stddevs. Of course that doesn’t replace a proper parallel profiler for large scale applications, but for smaller setups it’s a good quick and dirty solution.

1 Like

Thanks to @lyonsquark, v0.17 of MPI.jl (which I just tagged) should now support MPI profilers which use LD_PRELOAD hooks. I believe he has tested it with Darshan.

I haven’t tried it, but I believe you should also be able to use NVIDIA Nsight Systems to profile MPI even if you’re not using CUDA: just specify --trace=mpi option (you will also need to specify the MPI implementation via the --mpi-impl option).

I’d be keen to hear how people get on with various MPI profilers: if you do have problems (or find solutions to problems), please chime in here or open an issue.

4 Likes

We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.

I haven’t looked into it too deeply, but I think an MPI demuxing library is definitely doable. It wouldn’t be the same as libblastrampoline because the ABI is actually different between the different vendors (as opposed to BLAS where the different vendors tend to differ in only naming and/or ILP64-ness), so you’d need to do a bit of argument translation. I haven’t looked into how you would autodetect the ABI, but I’m sure it’s possible. It’s not high on my TODO list, but I am confident we can do something similar if we really want to.

2 Likes

This is the logic MPI.jl currently uses: MPI.jl/implementations.jl at 078723f8a052c7af863e0e70cf6dc3007fdc5d65 · JuliaParallel/MPI.jl · GitHub

I think it would make sense limit support to Open MPI, MPICH and Microsoft MPI ABIs, which would cover almost all current MPI implementations (other than >5 year old MPICH derivatives). The main challenge is how to define handles: they can be either 32-bit integers or pointers.

I can’t say this is what I did myself, but we got NAG involved in profiling our (CPU) MPI Julia code (for which we are very grateful!), and they used Extrae to profile it, providing some extremely clear results showing where outstanding problems were in great detail. We had to get advice on slack about how to instrument the libraries correctly, but it all worked out relatively(!) easily in the end (and maybe these recent fixes have solved those problems), except that we didn’t seem to be able to go beyond 127 MPI processes for some as-yet-unresolved reason…

Thanks for the helpful replies everyone! We’ll try @simonbyrne’s suggestion of using NVIDIA Nsight since it might allow us to kill two birds with one stone (GPU profiling and MPI profiling) and post back!