MPI.jl test_io_shared failing

Hi all, I’m testing my MPI.jl build, which has been built with the cluster’s MPI library (OpenMPI 3.1.6 compiled with GCC 8.3.0). The tests (]test MPI) succeed except for test_io_shared. The error message is below. I am just running on one node right now. Any thoughts?

Test Failed at Test FailedTest Failed at  at Test Failed at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
  Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated: 1 == 0
/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
  Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated:   Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated: 1 == 01 == 0

/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
  Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated: 1 == 0
ERROR: ERROR: ERROR: ERROR: LoadError: LoadError: LoadError: LoadError: There was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
There was an error during testingThere was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41

There was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
[1646673215.993979] [atl1-1-02-005-33:33246:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dd apid 4000081de is not released, refcount 1
[1646673215.994014] [atl1-1-02-005-33:33246:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dc apid 3000081de is not released, refcount 1
[1646673215.994023] [atl1-1-02-005-33:33246:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081df apid 2000081de is not released, refcount 1
[1646673215.994028] [atl1-1-02-005-33:33246:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081de apid 1000081de is not released, refcount 1
[1646673215.993981] [atl1-1-02-005-33:33247:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dd apid 4000081df is not released, refcount 1
[1646673215.994024] [atl1-1-02-005-33:33247:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dc apid 2000081df is not released, refcount 1
[1646673215.994031] [atl1-1-02-005-33:33247:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081df apid 1000081df is not released, refcount 1
[1646673215.994035] [atl1-1-02-005-33:33247:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081de apid 3000081df is not released, refcount 1
[1646673215.993975] [atl1-1-02-005-33:33244:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dd apid 2000081dc is not released, refcount 1
[1646673215.994012] [atl1-1-02-005-33:33244:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dc apid 1000081dc is not released, refcount 1
[1646673215.994023] [atl1-1-02-005-33:33244:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081df apid 4000081dc is not released, refcount 1
[1646673215.994028] [atl1-1-02-005-33:33244:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081de apid 3000081dc is not released, refcount 1
[1646673215.993981] [atl1-1-02-005-33:33245:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dd apid 1000081dd is not released, refcount 1
[1646673215.994018] [atl1-1-02-005-33:33245:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081dc apid 4000081dd is not released, refcount 1
[1646673215.994024] [atl1-1-02-005-33:33245:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081df apid 3000081dd is not released, refcount 1
[1646673215.994030] [atl1-1-02-005-33:33245:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 2000081de apid 2000081dd is not released, refcount 1
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[15804,1],2]
  Exit code:    1
--------------------------------------------------------------------------
test_io_shared.jl: Error During Test at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:26
  Got exception outside of a @test
  failed process: Process(`mpiexec -n 4 /storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/bin/julia -Cnative -J/storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --color=yes --startup-file=no /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl`, ProcessExited(1)) [1]

  Stacktrace:
    [1] pipeline_error
      @ ./process.jl:531 [inlined]
    [2] run(::Cmd; wait::Bool)
      @ Base ./process.jl:446
    [3] run
      @ ./process.jl:444 [inlined]
    [4] (::var"#13#15"{String})(cmd::Cmd)
      @ Main /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:38
    [5] (::MPI.var"#28#29"{var"#13#15"{String}})(cmd::Cmd)
      @ MPI /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/src/environment.jl:25
    [6] _mpiexec
      @ /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/deps/deps.jl:6 [inlined]
    [7] mpiexec(fn::var"#13#15"{String})
      @ MPI /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/src/environment.jl:25
    [8] macro expansion
      @ /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:27 [inlined]
    [9] top-level scope
      @ /storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/share/julia/stdlib/v1.7/Test/src/Test.jl:1359
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:451
   [11] top-level scope
      @ none:6
   [12] eval
      @ ./boot.jl:373 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:268
   [14] _start()
      @ Base ./client.jl:495
Test Summary:     | Error  Total
test_io_shared.jl |     1      1
Test Summary:     | Error  Total
test_io_shared.jl |     1      1

Ron Rahaman
Research Scientist II
Partnership for an Advanced Computing Environment (PACE)
Georgia Institute of Technology

Quick update: Before, I was not running with julia --threads. Now I am, and I get some more interesting info. Is the fork/exec/system call part of the tests only, or is it part of the MPI.jl backend?

Test FailedTest Failed at  at Test FailedTest Failed at  at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
  Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated: 1 == 0  Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated: 1 == 0

/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
  Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated: 1 == 0
/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
  Expression: MPI.File.get_position_shared(fh) == 0
   Evaluated: 1 == 0
ERROR: ERROR: ERROR: ERROR: LoadError: LoadError: LoadError: LoadError: There was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
There was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
There was an error during testingThere was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41

[1646677101.310694] [atl1-1-02-005-33:54219:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3c9 apid 30000d3cb is not released, refcount 1
[1646677101.310711] [atl1-1-02-005-33:54220:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3c9 apid 20000d3cc is not released, refcount 1
[1646677101.310711] [atl1-1-02-005-33:54218:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3c9 apid 40000d3ca is not released, refcount 1
[1646677101.310743] [atl1-1-02-005-33:54218:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3cb apid 20000d3ca is not released, refcount 1
[1646677101.310750] [atl1-1-02-005-33:54218:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3ca apid 10000d3ca is not released, refcount 1
[1646677101.310755] [atl1-1-02-005-33:54218:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3cc apid 30000d3ca is not released, refcount 1
[1646677101.310728] [atl1-1-02-005-33:54219:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3cb apid 10000d3cb is not released, refcount 1
[1646677101.310735] [atl1-1-02-005-33:54219:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3ca apid 40000d3cb is not released, refcount 1
[1646677101.310741] [atl1-1-02-005-33:54219:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3cc apid 20000d3cb is not released, refcount 1
[1646677101.310750] [atl1-1-02-005-33:54220:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3cb apid 40000d3cc is not released, refcount 1
[1646677101.310760] [atl1-1-02-005-33:54220:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3ca apid 30000d3cc is not released, refcount 1
[1646677101.310766] [atl1-1-02-005-33:54220:0]        mm_xpmem.c:91   UCX  WARN  remote segment id 20000d3cc apid 10000d3cc is not released, refcount 1
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[28577,1],0] (PID 54217)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28577,1],2]
  Exit code:    1
--------------------------------------------------------------------------
test_io_shared.jl: Error During Test at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:26
  Got exception outside of a @test
  failed process: Process(`mpiexec -n 4 /storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/bin/julia -Cnative -J/storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --color=yes --startup-file=no /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl`, ProcessExited(1)) [1]

  Stacktrace:
    [1] pipeline_error
      @ ./process.jl:531 [inlined]
    [2] run(::Cmd; wait::Bool)
      @ Base ./process.jl:446
    [3] run
      @ ./process.jl:444 [inlined]
    [4] (::var"#13#15"{String})(cmd::Cmd)
      @ Main /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:38
    [5] (::MPI.var"#28#29"{var"#13#15"{String}})(cmd::Cmd)
      @ MPI /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/src/environment.jl:25
    [6] _mpiexec
      @ /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/deps/deps.jl:6 [inlined]
    [7] mpiexec(fn::var"#13#15"{String})
      @ MPI /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/src/environment.jl:25
    [8] macro expansion
      @ /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:27 [inlined]
    [9] top-level scope
      @ /storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/share/julia/stdlib/v1.7/Test/src/Test.jl:1359
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:451
   [11] top-level scope
      @ none:6
   [12] eval
      @ ./boot.jl:373 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:268
   [14] _start()
      @ Base ./client.jl:495

I think there may be an error in those tests: would you mind opening an issue?