I am trying to build MPI on a cluster. I asked for instructions here but it was suggested that I go to Discourse.
I have specified the variables that I need to find libmpi but when I try building I get an error. To try and diagnose the problem I ran the build script in REPL and I have reduced it to this problem.
find_library has two arguments, the first is the library that I want and and the second is path is where to look for it. Unfortunately, I end up getting “”, meaning the path isn’t found.
I then tried to use LIbdl.dopen to see if I can open the library and it complains that libevent-2.1.so.6 cannot and I agree that it is not in the path that I specify.
I guess I need help making sure that I can find libmpi and libevent since it seems to be dependent on it. Any advice on how I can resolve this problem?
That is a good function to know and will not forget it. The output is below. I have been using EBROOTOPENMPI as a base for a lot of the paths that I have specified, and I guess looking at this I now know why.
$ module show openmpi/4.0.3
----------------------------------------------------------------------------------------------------------------------------------
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3.lua:
----------------------------------------------------------------------------------------------------------------------------------
help([[
Description
===========
The Open MPI Project is an open source MPI-3 implementation.
More information
================
- Homepage: https://www.open-mpi.org/
]])
whatis("Description: The Open MPI Project is an open source MPI-3 implementation.")
whatis("Homepage: https://www.open-mpi.org/")
whatis("URL: https://www.open-mpi.org/")
conflict("openmpi")
depends_on("ucx/1.8.0")
depends_on("libfabric/1.11.0")
depends_on("libfabric/1.10.1")
prepend_path("MODULEPATH","/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/MPI/intel2020/cuda11.0/openmpi4")
prepend_path("CMAKE_PREFIX_PATH","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3")
prepend_path("CPATH","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3/include")
prepend_path("LIBRARY_PATH","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3/lib")
prepend_path("MANPATH","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3/share/man")
prepend_path("PATH","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3/bin")
prepend_path("PKG_CONFIG_PATH","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3/lib/pkgconfig")
setenv("EBROOTOPENMPI","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3")
setenv("EBVERSIONOPENMPI","4.0.3")
setenv("EBDEVELOPENMPI","/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3/easybuild/avx2-CUDA-intel2020-cuda11.0-openmpi-4.0.3-easybuild-devel")
setenv("OMPI_MCA_plm_slurm_args","--whole")
setenv("SLURM_MPI_TYPE","pmix_v3")
setenv("RSNT_SLURM_MPI_TYPE","pmix_v3")
setenv("OMPI_MCA_btl","^openib")
setenv("OMPI_MCA_pml","^ucx,yalla")
setenv("OMPI_MCA_coll","^fca,hcoll")
setenv("OMPI_MCA_osc","^ucx")
setenv("PSM2_CUDA","1")
add_property("type_","mpi")
family("mpi")
Thank you for the suggestion. This is similar to what I have been trying but I still get an error with libmpi not found.
ERROR: LoadError: libmpi could not be found
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] top-level scope
@ ~/.julia/packages/MPI/E3Wer/deps/build.jl:76
[3] include(fname::String)
@ Base.MainInclude ./client.jl:444
[4] top-level scope
@ none:5
in expression starting at /home/fpoulin/.julia/packages/MPI/E3Wer/deps/build.jl:64
ERROR: Error building `MPI`:
It points to line 64 and that is where find_library tries to find a library in a path. Even though it seems to be there it gives the empty set. I read that this might be because the library has dependencies but I don’t know how to test that.
Ah, interesting, it is setting LIBRARY_PATH, but not LD_LIBRARY_PATH. LIBRARY_PATH is used by the compiler to find libraries for static linking, but not by the dynamic linker (or Julia), which uses LD_LIBRARY_PATH.
Thanks @simonbyrne for the suggestion. I added that line and it didn’t make any difference in the error.
I think that LD_LIBRARY_PATH is probably set correctly but the problem is that the build.jl doesn’t seem to know any of that. lilbrary and path are still what I set in the following lines:
Now I get that the path is full of good stuff, not optimal but I thought it worth a try, and I still get that find_library fails to confirm that libmpi is in the path.
Sorry I cannot contribute here - and I really should be able to.
Just commenting that it looks like you are using EESSI which uses CERNVMFS and Eaybuild. Fantastic!
Certainly. Copied below. I guess the good news is that it seems to have libmpi in the path in that I get exactly the same output as when I type out the exact location of libmpi.so
julia> dlopen("libmpi")
ERROR: could not load library "libmpi"
libevent-2.1.so.6: cannot open shared object file: No such file or directory
Stacktrace:
[1] dlopen(s::String, flags::UInt32; throw_error::Bool)
@ Base.Libc.Libdl ./libdl.jl:114
[2] dlopen (repeats 2 times)
@ ./libdl.jl:114 [inlined]
[3] top-level scope
@ REPL[2]:1
when I try dlopen("libevent-2.1") I get a message saying cannto open shared object file. If I instead type in the whole path I get something a bit more interesting.
julia> dlopen("/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libevent-2.1.so.6")
ERROR: could not load library "/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libevent-2.1.so.6"
libcrypto.so.1.1: cannot open shared object file: No such file or directory
When I try opening the libcrypto library I then get a problem with libc.
julia> dlopen("/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libcrypto.so.1.1")
ERROR: could not load library "/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libcrypto.so.1.1"
/lib64/libc.so.6: version `GLIBC_2.25' not found (required by /cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libcrypto.so.1.1)
Finally when I try opening libc then I get a segmentation fault. I wonder if this might be the source of the problem?
julia> dlopen("/lib64/libc.so.6")
signal (11): Segmentation fault
in expression starting at REPL[8]:1
_dl_relocate_object at /lib64/ld-linux-x86-64.so.2 (unknown line)
dl_open_worker at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dl_catch_error at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dl_open at /lib64/ld-linux-x86-64.so.2 (unknown line)
dlopen_doit at /lib64/libdl.so.2 (unknown line)
_dl_catch_error at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dlerror_run at /lib64/libdl.so.2 (unknown line)
dlopen at /lib64/libdl.so.2 (unknown line)
jl_load_dynamic_library at /buildworker/worker/package_linux64/build/src/dlload.c:257
#dlopen#3 at ./libdl.jl:114
dlopen at ./libdl.jl:114 [inlined]
dlopen at ./libdl.jl:114
jfptr_dlopen_52107.clone_1 at /home/fpoulin/software/julia-1.6.1/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:115
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:204
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:155 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:562
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:670
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:877
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:825
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:929
eval at ./boot.jl:360 [inlined]
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
repl_backend_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
start_repl_backend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
#run_repl#42 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
run_repl at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
#874 at ./client.jl:387
jfptr_YY.874_41532.clone_1 at /home/fpoulin/software/julia-1.6.1/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
jl_f__call_latest at /buildworker/worker/package_linux64/build/src/builtins.c:714
#invokelatest#2 at ./essentials.jl:708 [inlined]
invokelatest at ./essentials.jl:706 [inlined]
run_main_repl at ./client.jl:372
exec_options at ./client.jl:302
_start at ./client.jl:485
jfptr__start_34289.clone_1 at /home/fpoulin/software/julia-1.6.1/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
true_main at /buildworker/worker/package_linux64/build/src/jlapi.c:560
repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:702
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4007d8)
Allocations: 2651 (Pool: 2640; Big: 11); GC: 0
Segmentation fault
Thanks for sharing this, I did not know where easybuild comes from but it is certainly being used heavily on many big servers as part of compute canada.
Actually, there is a build in version of julia 1.6 on the server that I was using before. Unfortunately, when I tried to use Plots.jl it failed to produce an mp4 file. This doesn’t happen with the binary verison of julia 1.6, so I presume that’s a bug in the easybuild version. Do you think this is something I should mention somewhere and if yes, where exactly?
Easybuild exists to make the process of maintaining software on HPC systems - easy!
As you have seen there are several varieties of MPI , compilers, maths libraries etc. etc. on any HPC system. Easybuild have the concept of ‘toolchains’ such that applications can be built and maintained with given combinations of the basic tools - for example an Intel compiler version versus a gnu compiler version.
Also on HPC systems you will have software packages which are optimised for the particular CPU architecture you run on, not just the generic builds.