Hi all, I’m testing my MPI.jl build, which has been built with the cluster’s MPI library (OpenMPI 3.1.6 compiled with GCC 8.3.0). The tests (]test MPI
) succeed except for test_io_shared
. The error message is below. I am just running on one node right now. Any thoughts?
Test Failed at Test FailedTest Failed at at Test Failed at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
Expression: MPI.File.get_position_shared(fh) == 0
Evaluated: 1 == 0
/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
Expression: MPI.File.get_position_shared(fh) == 0
Evaluated: Expression: MPI.File.get_position_shared(fh) == 0
Evaluated: 1 == 01 == 0
/usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
Expression: MPI.File.get_position_shared(fh) == 0
Evaluated: 1 == 0
ERROR: ERROR: ERROR: ERROR: LoadError: LoadError: LoadError: LoadError: There was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
There was an error during testingThere was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
There was an error during testing
in expression starting at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl:41
[1646673215.993979] [atl1-1-02-005-33:33246:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dd apid 4000081de is not released, refcount 1
[1646673215.994014] [atl1-1-02-005-33:33246:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dc apid 3000081de is not released, refcount 1
[1646673215.994023] [atl1-1-02-005-33:33246:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081df apid 2000081de is not released, refcount 1
[1646673215.994028] [atl1-1-02-005-33:33246:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081de apid 1000081de is not released, refcount 1
[1646673215.993981] [atl1-1-02-005-33:33247:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dd apid 4000081df is not released, refcount 1
[1646673215.994024] [atl1-1-02-005-33:33247:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dc apid 2000081df is not released, refcount 1
[1646673215.994031] [atl1-1-02-005-33:33247:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081df apid 1000081df is not released, refcount 1
[1646673215.994035] [atl1-1-02-005-33:33247:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081de apid 3000081df is not released, refcount 1
[1646673215.993975] [atl1-1-02-005-33:33244:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dd apid 2000081dc is not released, refcount 1
[1646673215.994012] [atl1-1-02-005-33:33244:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dc apid 1000081dc is not released, refcount 1
[1646673215.994023] [atl1-1-02-005-33:33244:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081df apid 4000081dc is not released, refcount 1
[1646673215.994028] [atl1-1-02-005-33:33244:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081de apid 3000081dc is not released, refcount 1
[1646673215.993981] [atl1-1-02-005-33:33245:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dd apid 1000081dd is not released, refcount 1
[1646673215.994018] [atl1-1-02-005-33:33245:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081dc apid 4000081dd is not released, refcount 1
[1646673215.994024] [atl1-1-02-005-33:33245:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081df apid 3000081dd is not released, refcount 1
[1646673215.994030] [atl1-1-02-005-33:33245:0] mm_xpmem.c:91 UCX WARN remote segment id 2000081de apid 2000081dd is not released, refcount 1
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15804,1],2]
Exit code: 1
--------------------------------------------------------------------------
test_io_shared.jl: Error During Test at /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:26
Got exception outside of a @test
failed process: Process(`mpiexec -n 4 /storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/bin/julia -Cnative -J/storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --color=yes --startup-file=no /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/test_io_shared.jl`, ProcessExited(1)) [1]
Stacktrace:
[1] pipeline_error
@ ./process.jl:531 [inlined]
[2] run(::Cmd; wait::Bool)
@ Base ./process.jl:446
[3] run
@ ./process.jl:444 [inlined]
[4] (::var"#13#15"{String})(cmd::Cmd)
@ Main /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:38
[5] (::MPI.var"#28#29"{var"#13#15"{String}})(cmd::Cmd)
@ MPI /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/src/environment.jl:25
[6] _mpiexec
@ /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/deps/deps.jl:6 [inlined]
[7] mpiexec(fn::var"#13#15"{String})
@ MPI /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/src/environment.jl:25
[8] macro expansion
@ /usr/local/pace-apps/manual/packages/julia/1.7.2/gcc-8.3.0/pace/packages/MPI/08SPr/test/runtests.jl:27 [inlined]
[9] top-level scope
@ /storage/coda-apps/test/manual/packages/julia/1.7.2/gcc-8.3.0/share/julia/stdlib/v1.7/Test/src/Test.jl:1359
[10] include(fname::String)
@ Base.MainInclude ./client.jl:451
[11] top-level scope
@ none:6
[12] eval
@ ./boot.jl:373 [inlined]
[13] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:268
[14] _start()
@ Base ./client.jl:495
Test Summary: | Error Total
test_io_shared.jl | 1 1
Test Summary: | Error Total
test_io_shared.jl | 1 1
Ron Rahaman
Research Scientist II
Partnership for an Advanced Computing Environment (PACE)
Georgia Institute of Technology