OpenMPI+Threading: Segfault in compilation?

I have a package that uses multithreading (with Threads.@threads) and I am trying to parallelize it further using MPI.jl, and it segfaults when I run it on our HPC cluster (using the installation of OpenMPI 4.1.5 provided by the host).

The code works when

  • using mpiexecjl, with --threads 2
  • using mpiexec, with any -n, and no threads
  • not using MPI at all, with any number of --threads

but segfaults when I use the system provided OpenMPI installation and threading:

module load mpi/openmpi/4.1
module load julia/1.9.3

export ZES_ENABLE_SYSMAN=1 # as recommended by the docs, just in case
export OMPI_MCA_coll_hcoll_enable="0" # as recommended by the docs, just in case
mpiexec -n 1 julia --project=.... --threads 2 debug-mpi.jl

The error message is

[1699306339.988315] [hkn1990:166490:0]          parser.c:1911 UCX  WARN  unused env variables: UCX_MEMTYPE_CACHE,UCX_ERROR_SIGNALS (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[hkn1990:166490:1:166493] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x14a9242e5008)
==== backtrace (tid: 166493) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x000000000008f2e1 _jl_mutex_wait()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:717
 2 0x000000000008f3a0 _jl_mutex_lock()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:755
 3 0x00000000000bc393 jl_mutex_lock()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia_locks.h:66
 4 0x0000000000042310 jl_compile_method_internal()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2348
 5 0x0000000000042310 jl_compile_method_internal()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2237
 6 0x000000000004311e _jl_invoke()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2750
 7 0x000000000004311e ijl_apply_generic()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
 8 0x00000000000645c0 jl_apply()  /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880
=================================

[166490] signal (11.-6): Segmentation fault
in expression starting at /hkfs/home/project/hk-project-scs/hs2454/PMFRG/debug-MPI.jl:24
_jl_mutex_wait at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:717
_jl_mutex_lock at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/threading.c:755
jl_mutex_lock at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia_locks.h:66 [inlined]
jl_generate_fptr_impl at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jitlayers.cpp:424
jl_compile_method_internal at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2348 [inlined]
jl_compile_method_internal at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2237
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2750 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 3284432 (Pool: 3282767; Big: 1665); GC: 5
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node hkn1990 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The script debug-mpi.jl does not contain any MPI calls except the usual MPI.Init() and MPI.Finalize(), and if I comment out these it works perfectly without MPI and multiple threads.

By printf-debugging, it looks like it crashes during a Threads.@threads for.
I have tried to trim down the code to get a small neat reproduction case, but I encountered the program does not behave in a deterministic way, most of the time it crashes, sometimes it does not. I have the impression that removing code from the execution crash (i.e., commenting out code in the functions down the call tree) decreases the probability of a crash, but I am not sure about this.

Here are my questions:

  • Do you have suggestions on how to debug this?
  • To your trained eyes, does it look like something in the code I am working on or something that is out of my control (e.g., Julia internals)?

See Known issues · MPI.jl

Julia uses the signal internally and you need to tell MPI to forward the signal instead of aborting your process.

We handle this for OpenMPI+UCX, automatically but it seems that this doesn’t apply for your installation due to UCX WARN unused env variables:

Can you do a module info OpenMPI/4.1 what UCX version and build flags are involved?

2 Likes

Ah, I completely forgot/missed that part of the docs. Our installation does not seem to set UCX_ERROR_SIGNALS at all.

ompi_info gives:

                 Package: Open MPI XXXXXXXX@hktn1999.localdomain Distribution
                Open MPI: 4.1.5
  Open MPI repo revision: v4.1.5
   Open MPI release date: Feb 23, 2023
                Open RTE: 4.1.5
  Open RTE repo revision: v4.1.5
   Open RTE release date: Feb 23, 2023
                    OPAL: 4.1.5
      OPAL repo revision: v4.1.5
       OPAL release date: Feb 23, 2023
                 MPI API: 3.1.0
            Ident string: 4.1.5
                  Prefix: /software/all/mpi/openmpi/4.1.5_intel_19.1
 Configured architecture: x86_64-redhat-linux-gnu
          Configure host: hktn1999.localdomain
           Configured by: XXXXX
           Configured on: Sat Apr 15 18:48:06 UTC 2023
          Configure host: hktn1999.localdomain
  Configure command line: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--prefix=/software/all/mpi/openmpi/4.1.5_intel_19.1' '--bindir=/software/all/mpi/openmpi/4.1.5_intel_19.1/bin' '--datadir=/software/all/mpi/openmpi/4.1.5_intel_19.1/share' '--includedir=/software/all/mpi/openmpi/4.1.5_intel_19.1/include' '--libdir=/software/all/mpi/openmpi/4.1.5_intel_19.1/lib64' '--mandir=/software/all/mpi/openmpi/4.1.5_intel_19.1/share/man' '--infodir=/software/all/mpi/openmpi/4.1.5_intel_19.1/share/info' '--disable-dependency-tracking' '--enable-shared' '--enable-static' '--disable-heterogeneous' '--enable-mpi-thread-multiple' '--without-verbs' '--without-mxm' '--without-psm' '--without-psm2' '--without-ofi' '--with-cuda=/software/all/devel/cuda/11.8' '--with-ucx' '--with-hwloc=/opt/hwloc/2.7' '--with-hwloc-libdir=/opt/hwloc/2.7/lib' '--enable-mpi1-compatibility' '--without-tm' '--with-slurm' '--with-pmi' '--with-pmix=internal' '--enable-mpi-fortran=all' 'CC=/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icc' 'CFLAGS=-O2 -m64 -mtune=generic' 'CXX=/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icpc' 'CXXFLAGS=-O2 -m64 -mtune=generic' 'FC=/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort' 'FFLAGS=-O2 -m64 -mtune=generic' 'F77=/opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort'
                Built by: XXXXX
                Built on: Sat Apr 15 18:55:13 UTC 2023
              Built host: hktn1999.localdomain
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the /opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: /opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icc
     C compiler absolute: 
  C compiler family name: INTEL
      C compiler version: 1910.20200925
            C++ compiler: /opt/intel/compilers_and_libraries_2020/linux/bin/intel64/icpc
   C++ compiler absolute: none
           Fort compiler: /opt/intel/compilers_and_libraries_2020/linux/bin/intel64/ifort
       Fort compiler abs: 
         Fort ignore TKR: yes (!DEC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: yes
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.5)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.5)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.5)
              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.5)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: singularity (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA fs: gpfs (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.5)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.5)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.5)

Of course I can still use mpiexecjl without issues, but I wonder if there are advantages with the system installation - I am assuming it might have been tuned better for our hardware, and at the moment I cannot test this.

So what confuses me is this message in your error log:

So UCX sees UCX_ERROR_SIGNALS but is ignoring it?
Can you try and see which UCX version this may be?

Ah I even encountered this my self: UCX ignores UCX_ERROR_SIGNALS set by MPI.jl · Issue #409 · JuliaParallel/MPI.jl · GitHub

Do try running with: export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE"

2 Likes

So, the ucx version (by ucx_info) is 1.13.1.

Oh, I see. So even when UCX_ERROR_SIGNALS is not set when mpiexec runs, ucx will still take SIGSEGV as an error signal.

This was mentioned here, true, but it also says

If set externally, it should be modified to exclude SIGSEGV from the list”

which in this case is not true. Would an issue/make a pull request to update the docs be welcome, if I were to do it?

By the way and for the records, now that everything works even by using mpirun instead of mpiexecjl, I have managed to run both version and I see absolutely no difference in run times.

Always!