How to profile Julia MPI code?

How do people usually profile Julia MPI code to find bottlenecks and reduce communication overhead?

As an MPI noob I thought there would be some profiler tools out (kinda like cuprof and NVIDIA Nsight) there but they seem to be mostly proprietary (e.g. TotalView, Arm MAP) and probably overkill for our use case since we’re not running with thousands of ranks yet.

Looks like there may be some open-source language-independent tools like http://www.hpctoolkit.org/ and https://github.com/LLNL/mpiP which I might try looking into.

EDIT: Right now just trying to profile CPU MPI code, but definitely looking to profile GPU/CUDA-aware MPI code in the near future.

2 Likes

I would love to see graphs like in this paper by Parry Husbands and Kathy Yelick: https://upc.lbl.gov/publications/husbands-lu-sc07.pdf Figures 4 and 5. I imagine one can easily instrument the MPI, added the timers, and then log it all - one could produce various nice visualizations like we do with our profilers. I think they wrote some custom tooling for the charts in that paper.

We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.

I think Dagger.jl has some of this tooling now. I am sure I have seen pictures of such MPI communication visualizers at conferences that make me feel like there are products that do this and I found this SO thread: openmpi - Visualizing communication pattern of MPI processes - Stack Overflow

-viral

2 Likes

If you have relatively simple patterns and small amount of ranks you can get decent results just by profiling a single rank. You can slap a TimerOutput on your MPI calls to get a nice output of whether communication is limiting. If you have load balancing issues you can also time the operations and compute eg stddevs. Of course that doesn’t replace a proper parallel profiler for large scale applications, but for smaller setups it’s a good quick and dirty solution.

1 Like

Thanks to @lyonsquark, v0.17 of MPI.jl (which I just tagged) should now support MPI profilers which use LD_PRELOAD hooks. I believe he has tested it with Darshan.

I haven’t tried it, but I believe you should also be able to use NVIDIA Nsight Systems to profile MPI even if you’re not using CUDA: just specify --trace=mpi option (you will also need to specify the MPI implementation via the --mpi-impl option).

I’d be keen to hear how people get on with various MPI profilers: if you do have problems (or find solutions to problems), please chime in here or open an issue.

5 Likes

We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.

I haven’t looked into it too deeply, but I think an MPI demuxing library is definitely doable. It wouldn’t be the same as libblastrampoline because the ABI is actually different between the different vendors (as opposed to BLAS where the different vendors tend to differ in only naming and/or ILP64-ness), so you’d need to do a bit of argument translation. I haven’t looked into how you would autodetect the ABI, but I’m sure it’s possible. It’s not high on my TODO list, but I am confident we can do something similar if we really want to.

2 Likes

This is the logic MPI.jl currently uses: https://github.com/JuliaParallel/MPI.jl/blob/078723f8a052c7af863e0e70cf6dc3007fdc5d65/src/implementations.jl#L80-L147

I think it would make sense limit support to Open MPI, MPICH and Microsoft MPI ABIs, which would cover almost all current MPI implementations (other than >5 year old MPICH derivatives). The main challenge is how to define handles: they can be either 32-bit integers or pointers.

I can’t say this is what I did myself, but we got NAG involved in profiling our (CPU) MPI Julia code (for which we are very grateful!), and they used Extrae to profile it, providing some extremely clear results showing where outstanding problems were in great detail. We had to get advice on slack about how to instrument the libraries correctly, but it all worked out relatively(!) easily in the end (and maybe these recent fixes have solved those problems), except that we didn’t seem to be able to go beyond 127 MPI processes for some as-yet-unresolved reason…

Thanks for the helpful replies everyone! We’ll try @simonbyrne’s suggestion of using NVIDIA Nsight since it might allow us to kill two birds with one stone (GPU profiling and MPI profiling) and post back!

@PolarizedPoutine Have you had success with using nsys? I’m currently playing around with using it for regular non-mpi CPU profiling but so far with limited success.

Update: Two primitive examples for using NVIDIA Nsight Systems to profile Julia MPI code: https://github.com/carstenbauer/JuliaHLRS22/tree/main/backup/MPI%20profiling%20(nsys)

1 Like

And since I’m anyway necroposting, here is an experimental (but working) attempt to profile Julia MPI code with Score-P.jl.

(Would be great to start similar efforts for Extrae, HPCToolkit, TAU, etc.)

1 Like

Are get_arguments and integration Julia functions?

No, unfortunately not. They are manually assigned names of NVTX ranges (see, e.g., here).

1 Like

Maybe @mofeing would be interested in this? :eyes:

1 Like

Yes! Actually we have already started some efforts in Extrae.jl and can find us discussing in Slack’s #extrae channel.

We managed to trace some toy examples working with Distributed but we are crashing on some more complex examples, posible due to hacks :sweat_smile:.

We would thank the help from a Julia internals expert.

Hey! Yes, was going to reply this also. We are focusing on instrumenting the Distributed model for the moment, and then we’ll tackle the Threads part.

As for your Figures 4 and 5: Needs some visual fine-tuning but my (drafty) MPITape.jl can produce this:

(Unfortunately?) It is based on Cassette.jl and might come with an overhead though. Would be great to have an efficient, non-Cassette based tracing API in Julia. But I think some people are working on something in this direction if I’m not mistaken (see profiling on Slack).

2 Likes