I’m getting strange segmentation faults when using SharedArrays
and passing --threads
to Julia, even when I’m not actually using any of those threads! Here is a minimal example:
using Distributed
function main()
sa = SharedArray{Float64}(1000, 10000)
fill!(sa, 0)
println("write"); flush(stdout)
sa .= 1.0
println("read"); flush(stdout)
@everywhere workers() begin
# dummy calculation which reads from sa a lot
sa = $sa
for i in 1:size(sa, 2)
sum(
sum(1.1 .* @view sa[:, i])
for _ in 1:2000
)
end
end
println("DONE"); flush(stdout)
end
addprocs(
14;
topology=:all_to_all, lazy=false,
# results in segfault
exeflags=`--startup-file=no --threads=16`
# no segfault!
# exeflags=`--startup-file=no`
)
@everywhere begin
using Distributed
using SharedArrays
end
main()
which gives me output like the following:
$ julia test.jl write read From worker 9: From worker 9: [12998] signal (11.1): Segmentation fault From worker 9: in expression starting at none:1 From worker 9: Allocations: 20032552 (Pool: 18379714; Big: 1652838); GC: 308 Worker 9 terminated. Unhandled Task ERROR: EOFError: read end of file Stacktrace: [1] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64) @ Base ./stream.jl:947 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64) @ Base ./stream.jl:955 [3] unsafe_read @ ./io.jl:774 [inlined] [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64) @ Base ./io.jl:773 [5] read! @ ./io.jl:775 [inlined] [6] deserialize_hdr_raw @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined] [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})() @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
Using the commented line instead (without --threads
), I’m not getting any segmentation faults.
Triggering the segfault does seem to depend on the number of worker processes and possibly other factors, but it’s very unclear to me what is going on. Is this a bug or am I doing something wrong?
versioninfo()
because it seems like it could be relevant, but I note that I’ve reproduced this on several different machines:
julia> versioninfo() Julia Version 1.10.2 Commit bd47eca2c8a (2024-03-01 10:14 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, znver2) Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)