I’m trying to call a fortran code that depends on lapack, blas, fftw3 etc that is compiled so a “.so” file on Linux. I am trying to run it on a SLURM cluster after submitting a job interactively. I can load the on a single-processor as
(base) jb6888:~/ $ srun --pty -n 10 zsh
(base) jb6888:spherical_kernel/ $ julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.2.0 (2019-08-20)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using Distributed
julia> using Libdl
julia> @fetchfrom workers()[1] run(`ldd /home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so`)
linux-vdso.so.1 => (0x00007ffeccdb5000)
libcr_run.so => /lib64/libcr_run.so (0x00002b7640720000)
libfftw3.so.3 => /share/apps/NYUAD/fftw3/avx2/3.3.4/lib/libfftw3.so.3 (0x00002b7640923000)
libgfortran.so.3 => /share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3/libgfortran.so.3 (0x00002b7640cbe000)
libm.so.6 => /lib64/libm.so.6 (0x00002b7640fdc000)
liblapack.so => /share/apps/NYUAD/lapack/avx2/3.8.0/lib/liblapack.so (0x00002b76412de000)
libblas.so => /share/apps/NYUAD/blas/avx2/3.6.0/lib/libblas.so (0x00002b7641ff9000)
libgcc_s.so.1 => /share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/lib64/libgcc_s.so.1 (0x00002b7642312000)
libquadmath.so.0 => /share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3/libquadmath.so.0 (0x00002b7642529000)
libc.so.6 => /lib64/libc.so.6 (0x00002b7642767000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b7642b28000)
/lib64/ld-linux-x86-64.so.2 (0x00002b76402f8000)
Process(`ldd /home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so`, ProcessExited(0))
julia> @fetchfrom workers()[1] dlopen("/home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so")
Ptr{Nothing} @0x0000000001d883c0
So this seems to be working fine. Now if I want to work on multiple processes, I export the LD_LIBRARY_PATH
to workers from the master first and then try to load the library on a worker.
(base) jb6888:spherical_kernel/ $ julia --machine-file=$PBS_NODEFILE
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.2.0 (2019-08-20)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> ENV["LD_LIBRARY_PATH"] # It seems to be set on the master but not on the workers
"/home/jb6888/lib:/share/apps/NYUAD/cfitsio/avx2/3.380/lib:/share/apps/NYUAD/fftw3/avx2/3.3.4/lib:/share/apps/NYUAD/blas/avx2/3.6.0/lib:/share/apps/NYUAD/lapack/avx2/3.8.0/lib:/share/apps/NYUAD//gcc/mpc/1.0.3/el7/lib:/share/apps/NYUAD//gcc/mpfr/3.1.3/el7/lib:/share/apps/NYUAD//gcc/gmp/6.0.0/el7/lib:/share/apps/NYUAD//gcc/cloog/0.18.4/el7/lib:/share/apps/NYUAD//gcc/isl/0.14.1/el7/lib:/share/apps/NYUAD//gcc/binutils/2.25/el7/lib:/share/apps/NYUAD//gcc/binutils/2.25/el7/lib64:/share/apps/NYUAD//gcc/autogen/5.18.5/el7/lib:/share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/lib64:/share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3:/share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3/gcj-4.9.0-15:/home/jb6888/lib:"
julia> @everywhere ENV["LD_LIBRARY_PATH"]=$(ENV["LD_LIBRARY_PATH"])
julia> @fetchfrom workers()[1] ENV["LD_LIBRARY_PATH"] # now it's set on workers
"/home/jb6888/lib:/share/apps/NYUAD/cfitsio/avx2/3.380/lib:/share/apps/NYUAD/fftw3/avx2/3.3.4/lib:/share/apps/NYUAD/blas/avx2/3.6.0/lib:/share/apps/NYUAD/lapack/avx2/3.8.0/lib:/share/apps/NYUAD//gcc/mpc/1.0.3/el7/lib:/share/apps/NYUAD//gcc/mpfr/3.1.3/el7/lib:/share/apps/NYUAD//gcc/gmp/6.0.0/el7/lib:/share/apps/NYUAD//gcc/cloog/0.18.4/el7/lib:/share/apps/NYUAD//gcc/isl/0.14.1/el7/lib:/share/apps/NYUAD//gcc/binutils/2.25/el7/lib:/share/apps/NYUAD//gcc/binutils/2.25/el7/lib64:/share/apps/NYUAD//gcc/autogen/5.18.5/el7/lib:/share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/lib64:/share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3:/share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3/gcj-4.9.0-15:/home/jb6888/lib:"
This sets LD_LIBRARY_PATH
on the worker. I can now check that ldd
detects the links
julia> @spawnat workers()[1] run(`ldd /home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so`)
Future(2, 1, 35, nothing)
julia> From worker 2: linux-vdso.so.1 => (0x00007ffc9dbf0000)
From worker 2: libfftw3.so.3 => /share/apps/NYUAD/fftw3/avx2/3.3.4/lib/libfftw3.so.3 (0x00002b73dded9000)
From worker 2: libgfortran.so.3 => /share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3/libgfortran.so.3 (0x00002b73de274000)
From worker 2: libm.so.6 => /lib64/libm.so.6 (0x00002b73de592000)
From worker 2: liblapack.so => /share/apps/NYUAD/lapack/avx2/3.8.0/lib/liblapack.so (0x00002b73de894000)
From worker 2: libblas.so => /share/apps/NYUAD/blas/avx2/3.6.0/lib/libblas.so (0x00002b73df5af000)
From worker 2: libgcc_s.so.1 => /share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/lib64/libgcc_s.so.1 (0x00002b73df8c8000)
From worker 2: libquadmath.so.0 => /share/apps/NYUAD//gcc/gcc/4.9.3/el7/lib/gcc/x86_64-unknown-linux-gnu/4.9.3/libquadmath.so.0 (0x00002b73dfadf000)
From worker 2: libc.so.6 => /lib64/libc.so.6 (0x00002b73dfd1d000)
From worker 2: /lib64/ld-linux-x86-64.so.2 (0x00002b73ddab1000)
But if I can’t seem to load the library on the worker
julia> @everywhere using Libdl
julia> @fetchfrom workers()[1] dlopen("/home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so")
ERROR: On worker 2:
could not load library "/home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so"
libfftw3.so.3: cannot open shared object file: No such file or directory
#dlopen#3 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Libdl/src/Libdl.jl:109
dlopen at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Libdl/src/Libdl.jl:109 [inlined] (repeats 2 times)
#11 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/macros.jl:130
#112 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292 [inlined]
#111 at ./task.jl:268
Stacktrace:
[1] #remotecall_fetch#149 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:379 [inlined]
[2] remotecall_fetch(::Function, ::Distributed.Worker) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:371
[3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(remotecall_fetch), ::Function, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:406
[4] remotecall_fetch(::Function, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:406
[5] top-level scope at REPL[8]:1
I’m new to Libdl and don’t really understand the details. I have tried to open the library with the flags RTLD_LAZY|RTLD_DEEPBIND|RTLD_GLOBAL
but that didn’t help either
julia> @fetchfrom workers()[1] dlopen("/home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so",RTLD_LAZY|RTLD_DEEPBIND|RTLD_GLOBAL)
ERROR: On worker 2:
could not load library "/home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so"
libfftw3.so.3: cannot open shared object file: No such file or directory
#dlopen#3 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Libdl/src/Libdl.jl:109
dlopen at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Libdl/src/Libdl.jl:109 [inlined]
#13 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/macros.jl:130
#112 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292 [inlined]
#111 at ./task.jl:268
Stacktrace:
[1] #remotecall_fetch#149 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:379 [inlined]
[2] remotecall_fetch(::Function, ::Distributed.Worker) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:371
[3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(remotecall_fetch), ::Function, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:406
[4] remotecall_fetch(::Function, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:406
[5] top-level scope at REPL[9]:1
Strangely it works if I use the ClusterManagers
library on the login node to submit the job
(base) jb6888:spherical_kernel/ $ julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.2.0 (2019-08-20)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using ClusterManagers,Distributed
julia> addprocs_slurm(10);
Any[:lazy, :topology, :exeflags, :enable_threaded_blas, :exename, :dir]
Dict{Any,Any}()
removing old files
removing old Setting up srun commands
srun: job 1456749 queued and waiting for resources
srun: job 1456749 has been allocated resources
connecting to worker 10 out of 10
julia> @everywhere using Libdl
julia> @fetchfrom workers()[1] dlopen("/home/jb6888/.julia/packages/WignerD/Eipeu/src/shtools_wrapper.so")
Ptr{Nothing} @0x0000000000000000
I’m not sure what I need to do to get it to work if I submit the job by myself and use machine-file? More importantly why is Libdl not being able to locate the dependencies?