Hi to everybody,
I’m trying to run Julia on a HPC cluster we are using in our department for astrophysical simulations. The 64-bit system is running CentOS 6.2, and it has a front-end and 8 computing nodes, each with 12 CPUs. The job manager is SLURM.
I just installed Julia 0.6.2 and tried to use it with SLURM, but I found that the
julia executable crashes randomly when executed. This happens ~20-30% of the time, which means that if I run tens of instances of Julia using
--machinefile, there is always at least one node crashing, and this causes all the other processes to quit. The crash happens both with
julia and with
julia-debug, here is the output when running the latter:
$ julia/bin/julia-debug signal (11): Segmentation fault while loading no file, in expression starting on line 0 pthread_create at /lib64/libpthread.so.0 (unknown line) blas_thread_init at /home/tomasi/julia-d386e40c17/bin/../lib/julia/libopenblas64_.so (unknown line) gotoblas_init at /home/tomasi/julia-d386e40c17/bin/../lib/julia/libopenblas64_.so (unknown line) unknown function (ip: 0x3bc3c0e57e) unknown function (ip: 0x3bc3c12c24) unknown function (ip: 0x3bc3c0e195) unknown function (ip: 0x3bc3c12469) unknown function (ip: 0x3bc4800f65) unknown function (ip: 0x3bc3c0e195) unknown function (ip: 0x3bc480129b) dlopen at /lib64/libdl.so.2 (unknown line) jl_dlopen at /buildworker/worker/package_linux64/build/src/dlload.c:88 jl_load_dynamic_library_ at /buildworker/worker/package_linux64/build/src/dlload.c:189 jl_load_dynamic_library_e at /buildworker/worker/package_linux64/build/src/dlload.c:214 dlopen_e at ./libdl.jl:110 vendor at ./linalg/blas.jl:64 check at ./linalg/blas.jl:114 __init__ at ./linalg/linalg.jl:284 unknown function (ip: 0x7f3decf37536) jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926 jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 jl_module_run_initializer at /buildworker/worker/package_linux64/build/src/toplevel.c:87 _julia_init at /buildworker/worker/package_linux64/build/src/init.c:733 julia_init at /buildworker/worker/package_linux64/build/src/task.c:301 unknown function (ip: 0x40238f) __libc_start_main at /lib64/libc.so.6 (unknown line) unknown function (ip: 0x401608) Allocations: 811914 (Pool: 810886; Big: 1028); GC: 0 Segmentation fault (core dumped)
I am pretty sure that the problem is not Julia, but rather the environment: if I run
julia on the front-end instead of the computing nodes, everything works as expected. I am trying to understand where the problem lies, but now I am out of ideas. These are the things I noticed:
- There was a problem with
/lib64/libz.so(it wasn’t finding the version number), which I solved by compiling the latest version of the library and using
LD_PRELOADto load it.
- The cause of the crash seems to be a few functions missing (the
unknown functionerrors), but their name is not reported in the stack trace: is there a way to find them?
- Might this be because of the system libraries Julia is using? (
/lib64/libc.so.6) The files are pointing to
libc-2.12.so. Perhaps version 2.12 is too old? But the front-end has the same file names in
-2.12at the end), and segmentation faults do not occur when running Julia on it.
- I do not understand the stack trace: where did the segmentation fault happened? It just says
…while loading no file, in expression starting on line 0. Did the segmentation fault happened while running the call in the first line of the output (
pthread_create at /lib64/libpthread.so.0) or in the last one (
__libc_start_main at /lib64/libc.so.6)?
I would really like a few hints about where to look now.
Thanks a lot,