How do people usually profile Julia MPI code to find bottlenecks and reduce communication overhead?
As an MPI noob I thought there would be some profiler tools out (kinda like cuprof and NVIDIA Nsight) there but they seem to be mostly proprietary (e.g. TotalView, Arm MAP) and probably overkill for our use case since we’re not running with thousands of ranks yet.
I would love to see graphs like in this paper by Parry Husbands and Kathy Yelick: https://upc.lbl.gov/publications/husbands-lu-sc07.pdf Figures 4 and 5. I imagine one can easily instrument the MPI, added the timers, and then log it all - one could produce various nice visualizations like we do with our profilers. I think they wrote some custom tooling for the charts in that paper.
We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.
If you have relatively simple patterns and small amount of ranks you can get decent results just by profiling a single rank. You can slap a TimerOutput on your MPI calls to get a nice output of whether communication is limiting. If you have load balancing issues you can also time the operations and compute eg stddevs. Of course that doesn’t replace a proper parallel profiler for large scale applications, but for smaller setups it’s a good quick and dirty solution.
Thanks to @lyonsquark, v0.17 of MPI.jl (which I just tagged) should now support MPI profilers which use LD_PRELOAD hooks. I believe he has tested it with Darshan.
I haven’t tried it, but I believe you should also be able to use NVIDIA Nsight Systems to profile MPI even if you’re not using CUDA: just specify --trace=mpi option (you will also need to specify the MPI implementation via the --mpi-impl option).
I’d be keen to hear how people get on with various MPI profilers: if you do have problems (or find solutions to problems), please chime in here or open an issue.
We could do similar tricks as libblastrampoline to make this all nice and easy for MPI, but @staticfloat says ABI issues in MPI prevent this.
I haven’t looked into it too deeply, but I think an MPI demuxing library is definitely doable. It wouldn’t be the same as libblastrampoline because the ABI is actually different between the different vendors (as opposed to BLAS where the different vendors tend to differ in only naming and/or ILP64-ness), so you’d need to do a bit of argument translation. I haven’t looked into how you would autodetect the ABI, but I’m sure it’s possible. It’s not high on my TODO list, but I am confident we can do something similar if we really want to.
I think it would make sense limit support to Open MPI, MPICH and Microsoft MPI ABIs, which would cover almost all current MPI implementations (other than >5 year old MPICH derivatives). The main challenge is how to define handles: they can be either 32-bit integers or pointers.
I can’t say this is what I did myself, but we got NAG involved in profiling our (CPU) MPI Julia code (for which we are very grateful!), and they used Extrae to profile it, providing some extremely clear results showing where outstanding problems were in great detail. We had to get advice on slack about how to instrument the libraries correctly, but it all worked out relatively(!) easily in the end (and maybe these recent fixes have solved those problems), except that we didn’t seem to be able to go beyond 127 MPI processes for some as-yet-unresolved reason…
Thanks for the helpful replies everyone! We’ll try @simonbyrne’s suggestion of using NVIDIA Nsight since it might allow us to kill two birds with one stone (GPU profiling and MPI profiling) and post back!
@PolarizedPoutine Have you had success with using nsys? I’m currently playing around with using it for regular non-mpi CPU profiling but so far with limited success.
(Unfortunately?) It is based on Cassette.jl and might come with an overhead though. Would be great to have an efficient, non-Cassette based tracing API in Julia. But I think some people are working on something in this direction if I’m not mistaken (see profiling on Slack).
I’d be curious to hear whether there were any further developments of MPI tracing. Anyone? In particular, there is MPE. Is there any Julia code to process MPE log files?