How do people usually profile Julia MPI code to find bottlenecks and reduce communication overhead?
As an MPI noob I thought there would be some profiler tools out (kinda like
cuprof and NVIDIA Nsight) there but they seem to be mostly proprietary (e.g. TotalView, Arm MAP) and probably overkill for our use case since we’re not running with thousands of ranks yet.
EDIT: Right now just trying to profile CPU MPI code, but definitely looking to profile GPU/CUDA-aware MPI code in the near future.