Julia crashes when started on the nodes of a HPC cluster

question
hpc
debug
cluster

#1

Hi to everybody,

I’m trying to run Julia on a HPC cluster we are using in our department for astrophysical simulations. The 64-bit system is running CentOS 6.2, and it has a front-end and 8 computing nodes, each with 12 CPUs. The job manager is SLURM.

I just installed Julia 0.6.2 and tried to use it with SLURM, but I found that the julia executable crashes randomly when executed. This happens ~20-30% of the time, which means that if I run tens of instances of Julia using --machinefile, there is always at least one node crashing, and this causes all the other processes to quit. The crash happens both with julia and with julia-debug, here is the output when running the latter:

$ julia/bin/julia-debug

signal (11): Segmentation fault
while loading no file, in expression starting on line 0
pthread_create at /lib64/libpthread.so.0 (unknown line)
blas_thread_init at /home/tomasi/julia-d386e40c17/bin/../lib/julia/libopenblas64_.so (unknown line)
gotoblas_init at /home/tomasi/julia-d386e40c17/bin/../lib/julia/libopenblas64_.so (unknown line)
unknown function (ip: 0x3bc3c0e57e)
unknown function (ip: 0x3bc3c12c24)
unknown function (ip: 0x3bc3c0e195)
unknown function (ip: 0x3bc3c12469)
unknown function (ip: 0x3bc4800f65)
unknown function (ip: 0x3bc3c0e195)
unknown function (ip: 0x3bc480129b)
dlopen at /lib64/libdl.so.2 (unknown line)
jl_dlopen at /buildworker/worker/package_linux64/build/src/dlload.c:88
jl_load_dynamic_library_ at /buildworker/worker/package_linux64/build/src/dlload.c:189
jl_load_dynamic_library_e at /buildworker/worker/package_linux64/build/src/dlload.c:214
dlopen_e at ./libdl.jl:110
vendor at ./linalg/blas.jl:64
check at ./linalg/blas.jl:114
__init__ at ./linalg/linalg.jl:284
unknown function (ip: 0x7f3decf37536)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424
jl_module_run_initializer at /buildworker/worker/package_linux64/build/src/toplevel.c:87
_julia_init at /buildworker/worker/package_linux64/build/src/init.c:733
julia_init at /buildworker/worker/package_linux64/build/src/task.c:301
unknown function (ip: 0x40238f)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401608)
Allocations: 811914 (Pool: 810886; Big: 1028); GC: 0
Segmentation fault (core dumped)

I am pretty sure that the problem is not Julia, but rather the environment: if I run julia on the front-end instead of the computing nodes, everything works as expected. I am trying to understand where the problem lies, but now I am out of ideas. These are the things I noticed:

  1. There was a problem with /lib64/libz.so (it wasn’t finding the version number), which I solved by compiling the latest version of the library and using LD_PRELOAD to load it.
  2. The cause of the crash seems to be a few functions missing (the unknown function errors), but their name is not reported in the stack trace: is there a way to find them?
  3. Might this be because of the system libraries Julia is using? (/lib64/libpthread.so.0, /lib64/libdl.so.2, and /lib64/libc.so.6) The files are pointing to libpthread-2.12.so, libdl-2.12.so, and libc-2.12.so. Perhaps version 2.12 is too old? But the front-end has the same file names in /lib64 (i.e., with -2.12 at the end), and segmentation faults do not occur when running Julia on it.
  4. I do not understand the stack trace: where did the segmentation fault happened? It just says …while loading no file, in expression starting on line 0. Did the segmentation fault happened while running the call in the first line of the output (pthread_create at /lib64/libpthread.so.0) or in the last one (__libc_start_main at /lib64/libc.so.6)?

I would really like a few hints about where to look now.

Thanks a lot,
Maurizio.


#2

The stacktrace suggests a problem with libpthread. The easiest thing to do (if it is possible on your system) is to request an interactive session and compile julia on the compute node. This make sure the build system finds the version of the various libraries that will be available at runtime.

I have an older version of julia built on a cluster with libpthread-2.12, so I don’t think the version is too old.


#3

Thanks a lot, Jared, following your suggestion I was able to achieve some progress here. Here is what I did:

  1. Since my cluster has a very old GCC (4.4, it is running CentOS 6), I downloaded and compiled from scratch GCC 7.2.0.
  2. I was able to compile Julia in the front-end and then I copied all the binaries on the nodes.
  3. The julia executable no longer crashes: it runs fine both in the front-end and in each of the nodes.

I installed GCC 7.2.0 in /opt/gcc/7.2.0/ and created a module file to automatically set the value of environment variables like CXX_INCLUDE_PATH and LD_LIBRARY_PATH. This works without problems when I load the module and then run srun -I --pty (interactive job), as environment variables are propagated on the node where I am landing to.

However, there is a problem when using --machinefile. From what I understand, Julia uses SSH to spawn other copies of the executable on the nodes. When a copy is spawn on a node different from the one where I landed with srun -I --pty, it fails to find the right libstdc++.so.6 and prints many errors similar to the following one:

julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/tomasi/julia/usr/bin/../lib/libjulia.so.0.6)

Of course, file /usr/lib64/libstdc++.so.6 is part of the system’s GCC. I can reproduce this error with the following commands:

frontend$ srun -I --pty -N 1 -n 1 -p regular /bin/bash
node1$ echo $LD_LIBRARY_PATH
/opt/gcc/7.2.0/lib64:/opt/gcc/7.2.0/lib
node1$ # If I run "julia" now, everything is fine
node1$ ssh node2
node2$ echo $LD_LIBRARY_PATH

node2$ ./julia/usr/bin/julia
julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/tomasi/julia/usr/bin/../lib/libjulia.so.0.6)
julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/tomasi/julia/usr/bin/../lib/libjulia.so.0.6)
julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/tomasi/julia/usr/bin/../lib/libjulia.so.0.6)
julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by /home/tomasi/julia/usr/bin/../lib/libjulia.so.0.6)
julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/tomasi/julia/usr/bin/../lib/libLLVM-3.9.so)
(etc.)

From what I understand, you are running Julia on a cluster and you have compiled the executable by yourself. How did you set up the compiler’s environment variables? I thought I could copy libstdc++.so.6 in the same path as the julia executable, but I feel this would be rather hackish.

Many thanks, and a happy 2018!
Maurizio.


#4

Thanks for posing the shell snippet, that is very helpful. I think what is happening is the ssh node2 command is launching a new shell on the remote machine (therefore not inheriting environment variables from the old one, as you observed with the output of LD_LIBRARY_PATH being empty). In this case you would have to do module load gcc/7.2.0 on node2 and then run julia.

I believe srun will copy environment variables to the compute nodes (thats why the shell on node1 has LD_LIBRARY_PATH set correctly). So if you do srun -N 2 -n 2 -p regular julia -e 'println("hello world")' it should work correctly.


#5

Yes, you’re right! It works correctly:

$ srun -N 2 -n 2 -p regular julia/usr/bin/julia -e 'println("Hello, world")'
Hello, world
Hello, world

However, I do not understand how parallel code is supposed to work. Consider this output:

$ srun -N 2 -n 2 -p regular julia/usr/bin/julia -e 'println(nprocs())'
1
1

I would have expected this output, instead:

$ srun -N 2 -n 2 -p regular julia/usr/bin/julia -e 'println(nprocs())'
2

If I understand well, the command srun is spawning two instances of julia: this is the reason why I am getting two lines saying 1 instead of one line saying 2: is this correct?. Is there a way to make the two julia instances work together when using Julia’s parallel constructs like @everywhere?

Thanks a lot for your help, Jared, I’m amazed you had time to answer me on January, 1st!
Maurizio.


#6

Blockquote
this is the reason why I am getting two lines saying 1 instead of one line saying 2

Yes, exactly.

Blockquote
Is there a way to make the two julia instances work together when using Julia’s parallel constructs like @everywhere?

I use MPI for everything these days, so I don’t have any recent experience with this.

There is a MPIManager construct in the MPI.jl package that attempts to connect Julia’s parallel constructs with MPI. I haven’t used it, so I don’t how well it works.

Something I tried a few years ago was to allocate a job with sbatch and use julia’s --machinefile option to launch julia processes on that allocation. From my recollection, the script looked something like:

# slurm_job.sh
nodes=16
tasks=256

srun -N $ndoes -n $tasks hostname > hosts.$SLURM_JOB_ID
julia --machinefile hosts.$SLURM_JOB_ID ./main.jl

and it would be launched with sbatch -N 16 -n 256 -p regular slurm_jobs.sh.

One thing I observed at the time was that there was a lot of overhead with parallel communication. I haven’t tested more recent version of julia so I don’t know if that has improved (this is what motivated me to use MPI directly).


#7

Hi,

First of all, thanks to Jared for his help. The reason why I was interested in setting up Julia was specifically to compare a traditional MPI-based approach with the parallel constructs provided by the language. (You mentioned you already compared them and found that MPI was more efficient, but if I understand well, you did this test with a quite old version of Julia, didn’t you?)

After a lot of struggle, I managed to get everything working. I am posting my solution here, in the hope that it will be useful to somebody else.

I created a file .ssh/environment in each node, where I manually set LD_LIBRARY_PATH:

LD_LIBRARY_PATH=/opt/gcc/7.2.0/lib64:$LD_LIBRARY_PATH

File .ssh/environment is supposed to be read by ssh whenever a new connection to that computer is created. Unfortunately, the default settings for ssh prevent loading this file (PermitUserEnvironment is set to no), so I had to change the configuration file for sshd (CentOS 6 keeps it in /etc/ssh/sshd_config) on every node and then manually reload sshd with kill -HUP $(pidof sshd). After having done this, logging to one node and running julia --machinefile FILENAME, constructs like @parallel work without problems.

This solution is quite hackish (specifically, it forces LD_LIBRARY_PATH to use GCC 7.2.0 libraries even if the user has not loaded the module with module load gcc/7.2.0), but I cannot see a simpler solution here.

Thanks again to Jared for his help!

Maurizio.


#8

If you are using SLURM take a look at ClusterManagers.jl which has integration with SLURM


#9

Thank you very much for doing this, it is very important to build community knowledge.

That’s correct. I think it was v0.3. I’d be curious to see the results of your experiments if you don’t mind posting them.