Multi-Node Parallelism

parallel

#1

I’m attempting to use Julia in a multi-node environment. I’m using the Stampede 2 system at TACC. I followed the instructions found here: Building with Intel MKL on a KNL system to get a copy of Julia up and running on the system. I’m trying to scale up my code to work across multiple nodes using the --machinefile command line switch when starting Julia.

I submit the following command to the job scheduler (where machine_file contains a list of hosts generated by the scheduler and test.jl just prints out host names across hosts:

julia --machinefile $HOME/machine_file test.jl

I get the following errors when trying to do so:

Warning: Permanently added 'c455-074' (ECDSA) to the list of known hosts.
/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libjulia.so.0.6)

/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libjulia.so.0.6)

/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libLLVM-4.0.so)

/home1/04185/gh8728/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home1/04185/gh8728/julia/usr/bin/../lib/libLLVM-4.0.so)

ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::Base.Distributed.SSHManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::Base.Distributed.SSHManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::Base.Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{Base.Distributed.SSHManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:344
 [4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::Base.Distributed.SSHManager) at ./<missing>:0
 [5] #addprocs#29(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:319
 [6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::Base.Distributed.SSHManager) at ./<missing>:0
 [7] #addprocs#239(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{Any,1}) at ./distributed/managers.jl:114
 [8] process_options(::Base.JLOptions) at ./client.jl:271
 [9] _start() at ./client.jl:371

I’m able to run Julia on one node with multiple threads without a problem, so I suspect the problem is introduced by using multiple nodes. The above error looks like it’s searching for a version of C++ that isn’t available on the system. The output of: strings /usr/lib64/libstdc++.so.6 | grep GLIBCXX is:

GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_DEBUG_MESSAGE_LENGTH

The lack of a 3.4.20 and 3.4.21 line further suggests that the system is missing the version of C++ that is needed.

How should I proceed in getting the multi-node parallelism to work? My naive guess would be to get Julia to use an older version of C++ , but I’m unclear on how to do that.

Any help would be appreciated. Thanks in advance!


#2

I got in contact with the server managers and they insist that the right version of GLIBCXX is installed - just not in the location Julia is looking. The tech provided the following information:


staff.stampede2(36)$ strings /opt/apps/gcc/5.4.0/lib64/libstdc++.so.6 | grep GLIBCXX_3.4.20
GLIBCXX_3.4.20
GLIBCXX_3.4.20

staff.stampede2(1)$ ldd /home1/04185/gh8728/julia/usr/bin/julia
        linux-vdso.so.1 =>  (0x00007ffc1abef000)
        /opt/apps/xalt/1.7/lib64/libxalt_init.so (0x00002af5ccac5000)
        libjulia.so.0.6 => /home1/04185/gh8728/julia/usr/bin/../lib/libjulia.so.0.6 (0x00002af5ccccc000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002af5cd3cd000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00002af5cd5d1000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002af5cd7d9000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00002af5cd9f5000)
        libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00002af5cddb6000)
        libLLVM-4.0.so => /home1/04185/gh8728/julia/usr/bin/../lib/libLLVM-4.0.so (0x00002af5cdfbb000)
        libstdc++.so.6 => /opt/apps/gcc/5.4.0/lib64/libstdc++.so.6 (0x00002af5d001c000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00002af5d0396000)
        libgcc_s.so.1 => /opt/apps/gcc/5.4.0/lib64/libgcc_s.so.1 (0x00002af5d0698000)
        /lib64/ld-linux-x86-64.so.2 (0x00002af5cc8a2000)
        libz.so.1 => /usr/lib64/libz.so.1 (0x00002af5d08ae000)

It shows the right libstdc++ at

libstdc++.so.6 => /opt/apps/gcc/5.4.0/lib64/libstdc++.so.6 (0x00002af5d001c000)

and

staff.stampede2(40)$ ldd ~/julia/julia
        linux-vdso.so.1 =>  (0x00007fff15ba0000)
        /opt/apps/xalt/1.7/lib64/libxalt_init.so (0x00002ac7b7740000)
        libjulia.so.0.6 => not found
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002ac7b7947000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00002ac7b7b4b000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002ac7b7d53000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00002ac7b7f6f000)
        libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00002ac7b8330000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ac7b751d000)

Hopefully this additional information sheds some light on the problem.


#3

This kind of thing happens when the version of gcc on the login node differs from the default version of gcc on the compute nodes. HPC environments sometimes have loadable toolchain modules that can mess with this as well and cause problems like this when different toolchain modules are selected between the login node and the workers (I don’t have access to Stampede, but this is the case on Cori for example). Sometimes the easiest thing is to rebuild with a slightly older version of gcc. Alternatively you might want to try prepending /opt/apps/gcc/5.4.0/lib64/ to your LD_LIBRARY_PATH.


#4

I can look into that. As far as I can tell, the right modules are being pushed. When I start an interactive session, I can SSH between the compute nodes without a password, and I can manually start a Julia process on each node after SSHing to it. However, I cannot then add the second node via addprocs() within a Julia session (using the REPL). This leads me to believe that everything is being set up properly, at least for an interactive session.

It is possible that SLURM is not passing along the right stuff to the compute nodes in a non-interactive mode. Following along with http://www.stochasticlifestyle.com/multi-node-parallelism-in-julia-on-an-hpc/ shows that it is possible to get things running with SLURM so I’m skeptical of that being the problem. Further, the Stampede tech seemed to think everything is being passed on correctly.

Is there any reason to think that the SSH cluster manager would be looking for different libraries than the rest of Julia?

I can try the modifying the path variable and see. I’m trying to follow up with the Stampede techs to see if they can help pin-down why a different library is being found.

Thanks for your suggestions!