I’m attempting to use Julia in a multi-node environment. I’m using the Stampede 2 system at TACC. I followed the instructions found here: Building with Intel MKL on a KNL system to get a copy of Julia up and running on the system. I’m trying to scale up my code to work across multiple nodes using the --machinefile command line switch when starting Julia.
I submit the following command to the job scheduler (where machine_file contains a list of hosts generated by the scheduler and test.jl just prints out host names across hosts:
julia --machinefile $HOME/machine_file test.jl
I get the following errors when trying to do so:
Warning: Permanently added 'c455-074' (ECDSA) to the list of known hosts.
/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libjulia.so.0.6)
/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libjulia.so.0.6)
/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libLLVM-4.0.so)
/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libLLVM-4.0.so)
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::Base.Distributed.SSHManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::Base.Distributed.SSHManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::Base.Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{Base.Distributed.SSHManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
[1] sync_end() at ./task.jl:287
[2] macro expansion at ./task.jl:303 [inlined]
[3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:344
[4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::Base.Distributed.SSHManager) at ./<missing>:0
[5] #addprocs#29(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:319
[6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::Base.Distributed.SSHManager) at ./<missing>:0
[7] #addprocs#239(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{Any,1}) at ./distributed/managers.jl:114
[8] process_options(::Base.JLOptions) at ./client.jl:271
[9] _start() at ./client.jl:371
I’m able to run Julia on one node with multiple threads without a problem, so I suspect the problem is introduced by using multiple nodes. The above error looks like it’s searching for a version of C++ that isn’t available on the system. The output of: strings /usr/lib64/libstdc++.so.6 | grep GLIBCXX
is:
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_DEBUG_MESSAGE_LENGTH
The lack of a 3.4.20 and 3.4.21 line further suggests that the system is missing the version of C++ that is needed.
How should I proceed in getting the multi-node parallelism to work? My naive guess would be to get Julia to use an older version of C++ , but I’m unclear on how to do that.
Any help would be appreciated. Thanks in advance!