On our new server, we get random segmentation fault errors when using julia with multithreading.
There is no crash for the same scripts with a single thread, or multithreaded on other machines.
It seems to be independent of the version of julia (tried 1.8.5, 1.10.1, 1.10.2, 1.10.3 and 1.11.0).
The error is difficult to reproduce (any attempt at simple codes that would crash so far have failed), but happens even using only LinearAlgebra and FileIO (saving using JLD2).
In these cases, the only multithreading comes from the LinearAlgebra package (in a eigendecomposition call), and maybe from the saving?
We have not been able to crash without saving, but importantly the crash does not occur while saving. I am currently trying to investigate this point.
Crash probability seems to increase when several jobs are running on the computer.
Here is some relevant data:
Julia was installed through juliaup
julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 112 × Intel(R) Xeon(R) w9-3495X
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, sapphirerapids)
Threads: 1 default, 0 interactive, 1 GC (on 112 virtual cores)
The server has the following specs
The OS is Ubuntu, it has 1TB of RAM, saving is done on a raided SSD.
The cpu is (output of lspcu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) w9-3495X
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 56
Here are some crashing outputs using gdb on the julia jobs.
We use 8 threads.
julia 1.10.3 -t8
Thread 6 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd77ff640 (LWP 424890)]
julia_multiq_check_empty_75204 () at partr.jl:186
186 partr.jl: No such file or directory.
(gdb) bt full
#0 julia_multiq_check_empty_75204 () at partr.jl:186
No locals.
#1 0x00007fffe18c30c8 in jfptr_multiq_check_empty_75205 () from /home/XXX/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/lib/julia/sys.so
No symbol table info available.
#2 0x00007ffff6e46a0e in _jl_invoke (world=<optimized out>, mfunc=0x7fffe3eaf010 <jl_system_image_data+21842128>, nargs=0, args=0x0, F=0x7fffe3eaf180 <jl_system_image_data+21842496>)
at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gf.c:2895
last_alloc = <optimized out>
invoke = <optimized out>
codeinst = <optimized out>
last_errno = <optimized out>
res = <optimized out>
codeinst = <optimized out>
last_alloc = <optimized out>
last_errno = <optimized out>
invoke = <optimized out>
res = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
invoke = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
res = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
#3 ijl_apply_generic (F=F@entry=0x7fffe3eaf180 <jl_system_image_data+21842496>, args=args@entry=0x0, nargs=nargs@entry=0) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gf.c:3077
world = <optimized out>
mfunc = 0x7fffe3eaf010 <jl_system_image_data+21842128>
#4 0x00007ffff6e97e98 in check_empty (checkempty=0x7fffe3eaf180 <jl_system_image_data+21842496>) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/partr.c:340
No locals.
#5 ijl_task_get_next (trypoptask=<optimized out>, q=<optimized out>, checkempty=<optimized out>) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/partr.c:388
task = 0x0
ptls = <optimized out>
ct = <optimized out>
start_cycles = 527951640215745
#6 0x00007fffe18d3938 in julia_poptask_75373 () at task.jl:985
No locals.
#7 0x00007fffe1759192 in julia_wait_74655 () at task.jl:994
No locals.
#8 0x00007fffe17967fc in julia_task_done_hook_75286 () at task.jl:675
No locals.
#9 0x00007fffe0c82c47 in jfptr_task_done_hook_75287 () from /home/XXX/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/lib/julia/sys.so
No symbol table info available.
#10 0x00007ffff6e46a0e in _jl_invoke (world=<optimized out>, mfunc=0x7fffe5b50a40 <jl_system_image_data+51864320>, nargs=1, args=0x7ffda8dffd98, F=0x7fffe5b50ba0 <jl_system_image_data+51864672>)
at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gf.c:2895
--Type <RET> for more, q to quit, c to continue without paging--c
last_alloc = <optimized out>
invoke = <optimized out>
codeinst = <optimized out>
last_errno = <optimized out>
res = <optimized out>
codeinst = <optimized out>
last_alloc = <optimized out>
last_errno = <optimized out>
invoke = <optimized out>
res = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
invoke = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
res = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
__atomic_load_ptr = <optimized out>
__atomic_load_tmp = <optimized out>
#11 ijl_apply_generic (F=<optimized out>, args=args@entry=0x7ffda8dffd98, nargs=nargs@entry=1) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gf.c:3077
world = <optimized out>
mfunc = 0x7fffe5b50a40 <jl_system_image_data+51864320>
#12 0x00007ffff6e69c17 in jl_apply (nargs=2, args=0x7ffda8dffd90) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/julia.h:1982
No locals.
#13 jl_finish_task (t=0x7ffdc2bf2720) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/task.c:320
i__tr = 1
i__ca = 1
__excstack_state = <optimized out>
args = {0x7fffe5b50ba0 <jl_system_image_data+51864672>, 0x7ffdc2bf2720}
__eh = {eh_ctx = {{__jmpbuf = {140727870760736, -6395208788728286046, 140727436705184, 0, 140727870760736, 140736481725296, -6395208788701023070, -6396489091794144094}, __mask_was_saved = 0, __saved_mask = {__val = {140727870760792, 140727436705456, 140737335552526, 0, 140737204023312, 140728141919168, 0, 3440, 140727870760848, 0, 140727778357872, 0, 140737204023312, 7271379635568, 1056561956582, 0}}}}, gcstack = 0x0, prev = 0x0, gc_state = 0 '\000', locks_len = 0, defer_signal = 1, timing_stack = 0x7ffda8dffee0, world_age = 31577}
ct = <optimized out>
done = <optimized out>
#14 0x00007ffff6e69d9e in start_task () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/task.c:1249
ct = <optimized out>
ptls = <optimized out>
res = 0x7ffff01fe008
pt = <optimized out>
Julia 1.11.0, using julia -t8,1
julia_multiq_check_empty_64235 () at partr.jl:179
179 partr.jl: No such file or directory.
(gdb)
(gdb) bt full
#0 julia_multiq_check_empty_64235 () at partr.jl:179
No locals.
#1 0x00007fffe1f7b794 in jfptr_multiq_check_empty_64236 () from /home/XXX/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/lib/julia/sys.so
No symbol table info available.
#2 0x00007ffff6ca8f92 in check_empty (checkempty=0x7fffe39725c0 <jl_system_image_data+2480640>) at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/scheduler.c:378
No locals.
#3 ijl_task_get_next (trypoptask=<optimized out>, q=<optimized out>, checkempty=<optimized out>) at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/scheduler.c:434
task = 0x0
ptls = <optimized out>
ct = <optimized out>
start_cycles = 274963046078758
#4 0x00007fffe27fc2be in julia_poptask_64416 () at task.jl:998
No locals.
#5 0x00007fffe183e183 in julia_wait_63927 () at task.jl:1007
No locals.
#6 0x00007fffe1efa98d in julia_task_done_hook_64328 () at task.jl:687
No locals.
#7 0x00007fffe20b6d0f in jfptr_task_done_hook_64329 () from /home/XXX/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/lib/julia/sys.so
No symbol table info available.
#8 0x00007ffff6c75365 in jl_apply (nargs=2, args=0x7ffda6bedd90) at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2154
No locals.
#9 jl_finish_task (ct=ct@entry=0x7ffdbe39aef0) at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/task.c:319
i__try = 1
i__catch = 1
__eh_ct = <optimized out>
__excstack_state = <optimized out>
args = {0x7fffe5bed6f0 <jl_system_image_data+38636336>, 0x7ffdbe39aef0}
__eh = {eh_ctx = {{__jmpbuf = {140727794904816, -755499431405087535, 140727400979872, 0, 140727794904816, 140736414616432, -755499431365241647, -754548968466869039}, __mask_was_saved = 0, __saved_mask = {__val = {140737211523080,
305537, 140737211301968, 117176, 140736414616432, 0, 140728046517136, 140727400980192, 140727794904928, 140727794904816, 140736414616432, 140727400980112, 140728009008062, 140727848373776, 140727848373608,
140727848373632}}}}, gcstack = 0x0, prev = 0x0, gc_state = 0 '\000', locks_len = 0, defer_signal = 1, timing_stack = 0x7ffda6bedeb0, world_age = 26593}
done = <optimized out>
#10 0x00007ffff6c7552b in start_task () at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/task.c:1213
ct = <optimized out>
ptls = <optimized out>
res = 0x7fffef7fe008
pt = <optimized out>