Julia crashes inside @threads with MPI

Hi, I am trying to implement a program with hybrid MPI + Threads parallelization.

In the following example, if I comment out all MPI-related lines, the code runs without any problem. However, if I run the code with the MPI lines, Julia crashes in a probabilistic manner (crash probability around 50%).

# mwe.jl
using MPI
using Base.Threads

MPI.Init_thread(MPI.THREAD_FUNNELED)
world_comm = MPI.COMM_WORLD

struct MyStruct
    v::Vector{Vector{Float64}}
end

mystruct = MyStruct([[1.0] for i=1:nthreads()])

println("With @threads")
Threads.@threads :static for i in 1:nthreads()
    println(mystruct.v[threadid()])
end

MPI.Finalize()

Run julia -t 2 mwe.jl, output:

With @threads

signal (11): Segmentation fault
in expression starting at /home/jmlim/julia_epw/EPW.jl/running/mwe.jl:16
jl_mutex_wait at /buildworker/worker/package_linux64/build/src/locks.h:37 [inlined]
jl_mutex_lock at /buildworker/worker/package_linux64/build/src/locks.h:94
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:272
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1964
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1919 [inlined]
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2224 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
println at ./coreio.jl:4
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2231 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
macro expansion at /home/jmlim/julia_epw/EPW.jl/running/mwe.jl:17 [inlined]
#3#threadsfor_fun at ./threadingconstructs.jl:81
#3#threadsfor_fun at ./threadingconstructs.jl:48
unknown function (ip: 0x2ab50928062c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2231 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:705
unknown function (ip: (nil))
Allocations: 780384 (Pool: 780097; Big: 287); GC: 1
Segmentation fault (core dumped)

julia -t 1 mwe.jl output is okay:

With @threads
[1.0]

Another observation is that if I use v = [[1.0] for i=1:nthreads()] as an independent variable, not as a field of a struct mystruct.v, Julia does not crash.

What is the reason for this crash?

I am using Intel MPI Version 2019 on Linux (CentOS 7). I downloaded Julia 1.5.3 binary. I tested two MPI.jl verisions: v0.14.3, v0.16.1 (most recent) and both crashes.

julia> MPI.identify_implementation()
(MPI.IntelMPI, v"2019.0.0")

(In the real program, the v::Vector{Vector{Float64}} field is used as a pre-allocated buffer, one for each threads.)

It seems that the issue comes from the system MPI.
By using the automatically downloaded MPICH, the problem does not occur.

If you are calling MPI.Init, you typically have to run Julia with mpirun, mpiexec or srun (depends on your system and the MPI variant). For example

# Run with one MPI rank
mpiexec -n 1 julia -t 2 mwe.jl
1 Like

Thank you for the reply. And thank you for the information about the mpirun and mpiexec command.
Unfortunately, the crash persists when using mpirun or mpiexec.

Since the crash does not happen for Julia-provided MPICH, I suspect that the system MPI is the reason for the crash.

Hi @Jae-Mo_Lihm
Take a look at this section of the MPI.jl manual:
https://juliaparallel.github.io/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling

Hi @fverdugo
Thank you for the pointer! My case indeed seems very related to the issue https://github.com/JuliaParallel/MPI.jl/issues/337.

The error seems to be fixed for a recent version of OpenMPI. I am using intel MPI and the error occurs. Calling export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE" did not fix the crash.

$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923 (id: abd58e492)
Copyright 2003-2020, Intel Corporation.