Error handling in Julia with MPI -- how to catch errors?

mrhardman · October 26, 2022, 8:58am

I am part of a team developing a MPI physics code (mabarnes/moment_kinetics (github.com). We have integrated shared memory MPI into the code using the MPI.jl package.

I have been looking for documentation for how to track errors when using MPI. Specifically, I have noticed that when the code fails because of a relatively normal domain error (e.g. sqrt(-1) or log(-1) is called), then all cores hang, rather than reporting the error normally. If I run with just a single core, then the error messages are returned as normal. I am aware of the “try finally catch” syntax in Julia, and how to use MPI in a Fortran code, but when I am unclear on is how to weld these together. It seems that when an error is met on one core, this is not communicated between cores, and the program hangs.

Can anyone please suggest some documentation for me to read? Unfortunately, I do not have a MWE, although I do have an issue open on our own project (2D code appears to hang without error when running on HPC · Issue #86 · mabarnes/moment_kinetics (github.com)).

Topic		Replies	Views
Error/segfault in basic test of CUDA-aware MPI Julia at Scale question	10	1417	November 6, 2020
Julia crashes inside @threads with MPI Julia at Scale multithreading , mpi	5	1290	December 29, 2020
CUDA kernel crash very occasionally when MPI.jl is just loaded General Usage question	2	69	July 1, 2024
Mixing blocking MPI communication and Julia concurrency General Usage	0	155	June 16, 2023
Julia 0.6 and MPI wrapper General Usage	3	902	October 20, 2017

Error handling in Julia with MPI -- how to catch errors?

Related topics