OpenMPI+Threading: Segfault in compilation?

I have a package that uses multithreading (with Threads.@threads) and I am trying to parallelize it further using MPI.jl, and it segfaults when I run it on our HPC cluster (using the installation of OpenMPI 4.1.5 provided by the host).

The code works when

  • using mpiexecjl, with --threads 2
  • using mpiexec, with any -n, and no threads
  • not using MPI at all, with any number of --threads

but segfaults when I use the system provided OpenMPI installation and threading:

module load mpi/openmpi/4.1
module load julia/1.9.3

export ZES_ENABLE_SYSMAN=1 # as recommended by the docs, just in case
export OMPI_MCA_coll_hcoll_enable="0" # as recommended by the docs, just in case
mpiexec -n 1 julia --project=.... --threads 2 debug-mpi.jl

The error message is

[1699306339.988315] [hkn1990:166490:0]          parser.c:1911 UCX  WARN  unused env variables: UCX_MEMTYPE_CACHE,UCX_ERROR_SIGNALS (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[hkn1990:166490:1:166493] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x14a9242e5008)
==== backtrace (tid: 166493) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x000000000008f2e1 _jl_mutex_wait()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:717
 2 0x000000000008f3a0 _jl_mutex_lock()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:755
 3 0x00000000000bc393 jl_mutex_lock()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia_locks.h:66
 4 0x0000000000042310 jl_compile_method_internal()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2348
 5 0x0000000000042310 jl_compile_method_internal()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2237
 6 0x000000000004311e _jl_invoke()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2750
 7 0x000000000004311e ijl_apply_generic()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
 8 0x00000000000645c0 jl_apply()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880

[166490] signal (11.-6): Segmentation fault
in expression starting at /hkfs/home/project/hk-project-scs/hs2454/PMFRG/debug-MPI.jl:24
_jl_mutex_wait at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:717
_jl_mutex_lock at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:755
jl_mutex_lock at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia_locks.h:66 [inlined]
jl_generate_fptr_impl at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jitlayers.cpp:424
jl_compile_method_internal at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2348 [inlined]
jl_compile_method_internal at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2237
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2750 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 3284432 (Pool: 3282767; Big: 1665); GC: 5
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec noticed that process rank 0 with PID 0 on node hkn1990 exited on signal 11 (Segmentation fault).

The script debug-mpi.jl does not contain any MPI calls except the usual MPI.Init() and MPI.Finalize(), and if I comment out these it works perfectly without MPI and multiple threads.

By printf-debugging, it looks like it crashes during a Threads.@threads for.
I have tried to trim down the code to get a small neat reproduction case, but I encountered the program does not behave in a deterministic way, most of the time it crashes, sometimes it does not. I have the impression that removing code from the execution crash (i.e., commenting out code in the functions down the call tree) decreases the probability of a crash, but I am not sure about this.

Here are my questions:

  • Do you have suggestions on how to debug this?
  • To your trained eyes, does it look like something in the code I am working on or something that is out of my control (e.g., Julia internals)?

See Known issues · MPI.jl

Julia uses the signal internally and you need to tell MPI to forward the signal instead of aborting your process.

We handle this for OpenMPI+UCX, automatically but it seems that this doesn’t apply for your installation due to UCX WARN unused env variables:

Can you do a module info OpenMPI/4.1 what UCX version and build flags are involved?


Ah, I completely forgot/missed that part of the docs. Our installation does not seem to set UCX_ERROR_SIGNALS at all.

ompi_info gives:

Of course I can still use mpiexecjl without issues, but I wonder if there are advantages with the system installation - I am assuming it might have been tuned better for our hardware, and at the moment I cannot test this.

So what confuses me is this message in your error log:

So UCX sees UCX_ERROR_SIGNALS but is ignoring it?
Can you try and see which UCX version this may be?

Ah I even encountered this my self: UCX ignores UCX_ERROR_SIGNALS set by MPI.jl · Issue #409 · JuliaParallel/MPI.jl · GitHub

Do try running with: export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE"


So, the ucx version (by ucx_info) is 1.13.1.

Oh, I see. So even when UCX_ERROR_SIGNALS is not set when mpiexec runs, ucx will still take SIGSEGV as an error signal.

This was mentioned here, true, but it also says

If set externally, it should be modified to exclude SIGSEGV from the list”

which in this case is not true. Would an issue/make a pull request to update the docs be welcome, if I were to do it?

By the way and for the records, now that everything works even by using mpirun instead of mpiexecjl, I have managed to run both version and I see absolutely no difference in run times.
