How to profile Julia MPI code?

PolarizedPoutine · March 14, 2021, 5:01pm

How do people usually profile Julia MPI code to find bottlenecks and reduce communication overhead?

As an MPI noob I thought there would be some profiler tools out (kinda like cuprof and NVIDIA Nsight) there but they seem to be mostly proprietary (e.g. TotalView, Arm MAP) and probably overkill for our use case since we’re not running with thousands of ranks yet.

Looks like there may be some open-source language-independent tools like http://www.hpctoolkit.org/ and https://github.com/LLNL/mpiP which I might try looking into.

EDIT: Right now just trying to profile CPU MPI code, but definitely looking to profile GPU/CUDA-aware MPI code in the near future.

viralbshah · March 14, 2021, 5:25pm

I would love to see graphs like in this paper by Parry Husbands and Kathy Yelick: https://upc.lbl.gov/publications/husbands-lu-sc07.pdf Figures 4 and 5. I imagine one can easily instrument the MPI, added the timers, and then log it all - one could produce various nice visualizations like we do with our profilers. I think they wrote some custom tooling for the charts in that paper.

We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.

I think Dagger.jl has some of this tooling now. I am sure I have seen pictures of such MPI communication visualizers at conferences that make me feel like there are products that do this and I found this SO thread: openmpi - Visualizing communication pattern of MPI processes - Stack Overflow

-viral

antoine-levitt · March 14, 2021, 6:31pm

If you have relatively simple patterns and small amount of ranks you can get decent results just by profiling a single rank. You can slap a TimerOutput on your MPI calls to get a nice output of whether communication is limiting. If you have load balancing issues you can also time the operations and compute eg stddevs. Of course that doesn’t replace a proper parallel profiler for large scale applications, but for smaller setups it’s a good quick and dirty solution.

simonbyrne · March 15, 2021, 3:55pm

Thanks to @lyonsquark, v0.17 of MPI.jl (which I just tagged) should now support MPI profilers which use LD_PRELOAD hooks. I believe he has tested it with Darshan.

I haven’t tried it, but I believe you should also be able to use NVIDIA Nsight Systems to profile MPI even if you’re not using CUDA: just specify --trace=mpi option (you will also need to specify the MPI implementation via the --mpi-impl option).

I’d be keen to hear how people get on with various MPI profilers: if you do have problems (or find solutions to problems), please chime in here or open an issue.

staticfloat · March 15, 2021, 4:52pm

We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.

I haven’t looked into it too deeply, but I think an MPI demuxing library is definitely doable. It wouldn’t be the same as libblastrampoline because the ABI is actually different between the different vendors (as opposed to BLAS where the different vendors tend to differ in only naming and/or ILP64-ness), so you’d need to do a bit of argument translation. I haven’t looked into how you would autodetect the ABI, but I’m sure it’s possible. It’s not high on my TODO list, but I am confident we can do something similar if we really want to.

simonbyrne · March 15, 2021, 6:02pm

This is the logic MPI.jl currently uses: https://github.com/JuliaParallel/MPI.jl/blob/078723f8a052c7af863e0e70cf6dc3007fdc5d65/src/implementations.jl#L80-L147

I think it would make sense limit support to Open MPI, MPICH and Microsoft MPI ABIs, which would cover almost all current MPI implementations (other than >5 year old MPICH derivatives). The main challenge is how to define handles: they can be either 32-bit integers or pointers.

richardreeve · March 15, 2021, 9:43pm

I can’t say this is what I did myself, but we got NAG involved in profiling our (CPU) MPI Julia code (for which we are very grateful!), and they used Extrae to profile it, providing some extremely clear results showing where outstanding problems were in great detail. We had to get advice on slack about how to instrument the libraries correctly, but it all worked out relatively(!) easily in the end (and maybe these recent fixes have solved those problems), except that we didn’t seem to be able to go beyond 127 MPI processes for some as-yet-unresolved reason…

PolarizedPoutine · March 16, 2021, 6:45pm

Thanks for the helpful replies everyone! We’ll try @simonbyrne’s suggestion of using NVIDIA Nsight since it might allow us to kill two birds with one stone (GPU profiling and MPI profiling) and post back!

carstenbauer · August 29, 2021, 5:55pm

@PolarizedPoutine Have you had success with using nsys? I’m currently playing around with using it for regular non-mpi CPU profiling but so far with limited success.

carstenbauer · March 29, 2023, 2:13pm

Update: Two primitive examples for using NVIDIA Nsight Systems to profile Julia MPI code: https://github.com/carstenbauer/JuliaHLRS22/tree/main/backup/MPI%20profiling%20(nsys)

carstenbauer · March 29, 2023, 2:19pm

And since I’m anyway necroposting, here is an experimental (but working) attempt to profile Julia MPI code with Score-P.jl.

(Would be great to start similar efforts for Extrae, HPCToolkit, TAU, etc.)

giordano · March 29, 2023, 2:22pm

Are get_arguments and integration Julia functions?

carstenbauer · March 29, 2023, 2:23pm

No, unfortunately not. They are manually assigned names of NVTX ranges (see, e.g., here).

giordano · March 29, 2023, 2:25pm

Maybe @mofeing would be interested in this?

mofeing · March 29, 2023, 2:45pm

Yes! Actually we have already started some efforts in Extrae.jl and can find us discussing in Slack’s #extrae channel.

We managed to trace some toy examples working with Distributed but we are crashing on some more complex examples, posible due to hacks .

We would thank the help from a Julia internals expert.

clasqui · March 29, 2023, 2:48pm

Hey! Yes, was going to reply this also. We are focusing on instrumenting the Distributed model for the moment, and then we’ll tackle the Threads part.

carstenbauer · March 29, 2023, 2:55pm

As for your Figures 4 and 5: Needs some visual fine-tuning but my (drafty) MPITape.jl can produce this:

(Unfortunately?) It is based on Cassette.jl and might come with an overhead though. Would be great to have an efficient, non-Cassette based tracing API in Julia. But I think some people are working on something in this direction if I’m not mistaken (see profiling on Slack).

PetrKryslUCSD · October 3, 2024, 12:16am

I’d be curious to hear whether there were any further developments of MPI tracing. Anyone? In particular, there is MPE. Is there any Julia code to process MPE log files?

Topic		Replies	Views
Speed issues with MPI.jl on slurm cluster Julia at Scale	4	445	August 25, 2023
@profile slows down MPI? Performance mpi , profiling	0	31	October 24, 2024
Distributed.jl vs MPI.jl Performance question , package , mpi , distributed	26	6441	January 31, 2022
Mpi timing Julia at Scale	2	1276	November 19, 2021
Nsight compute from CUDA.jl and source annotation Performance gpu	3	721	March 25, 2021

How to profile Julia MPI code?

Related topics