Multithreading throw segmentfault occationally

Hi I’m wondering if this is some kind of known issue, basically I have one simulation task for each Task, and I spawn 72 threads to execute these tasks, then my code throw a segment fault from the BLAS when I create 160 tasks using the Threads.@spawn , I tried to find which task causes the segment fault exactly, so

  1. I manually split these 160 tasks into 80 tasks and another 80 tasks, I don’t get the segment fault anymore…
  2. then I use --check-bounds=yes check if there is anything out of bounds, but I don’t get the segment fault anymore either

The entire codebase is quite huge, thus I cannot find a small readable script to reproduce it, and since there is no error message from Julia (but from a BLAS function) I can’t figure out which is causing it,
any idea what I can use to trace this error further?

the segment fault trace looks like

signal (11): Segmentation fault
in expression starting at :1
unknown function (ip: 0x7fa766645941)
zgemv_n_SKYLAKEX at /home/ubuntu/packages/julias/julia-1.5/bin/../lib/julia/libopenblas64_.so (unknown line)
zgemv_64_ at /home/ubuntu/packages/julias/julia-1.5/bin/../lib/julia/libopenblas64_.so (unknown line)
gemv! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/blas.jl:626
gemv! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:470
mul! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:66 [inlined]
mul! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:208 [inlined]
#expv!#27 at /home/ubuntu/.julia/packages/ExponentialUtilities/XXu86/src/krylov_phiv.jl:118

I can always reproduce the segment fault by running the entire thing (all 160 tasks on 72 threads) without setting --check-bounds=yes

1 Like

for more detailed configuration I have set the BLAS.set_num_threads(1) and the thread spawning part looks like (but the actual code is much longer…)

# 160 jobs here, each will be running as a Task
# all the jobs are independent from each other
# (in principal they can also be running in different processes)
jobs = generate_jobs(configs) 
@sync for i in 1:length(jobs)
   @spawn begin
           run(jobs[i])
           # some post process
    end
end

The BLAS call happens inside each job.

OK as suggested from slack, I set JULIA_COPY_STACKS=1 and now it works without segment faults! I’m not sure about why however.