I am part of a team developing a MPI physics code (mabarnes/moment_kinetics (github.com). We have integrated shared memory MPI into the code using the MPI.jl package.
I have been looking for documentation for how to track errors when using MPI. Specifically, I have noticed that when the code fails because of a relatively normal domain error (e.g. sqrt(-1) or log(-1) is called), then all cores hang, rather than reporting the error normally. If I run with just a single core, then the error messages are returned as normal. I am aware of the “try finally catch” syntax in Julia, and how to use MPI in a Fortran code, but when I am unclear on is how to weld these together. It seems that when an error is met on one core, this is not communicated between cores, and the program hangs.
Can anyone please suggest some documentation for me to read? Unfortunately, I do not have a MWE, although I do have an issue open on our own project (2D code appears to hang without error when running on HPC · Issue #86 · mabarnes/moment_kinetics (github.com)).