I have no idea how to debug this problem, so any even vague “that sounds like…” is appreciated.
My program works when distributed within a single node, but fails when given tasks/cpus that span two nodes. I unfortunately haven’t been able to make a MWE. Simple attempts to replicate the problem only show that, ostensibly, both Julia and the slurm cluster work just fine with each other. The program goes as follows:
- start
julia -p P
through a slurm batch job assignedP
tasks. - set up a
RemoteChannel
that can hold a number of results somewhat less than will fit into a single task’s memory - use
@spawnat
to run a function thattake
s from theRemoteChannel
and save the result out to a drive shared across the cluster - use
@distributed
to iterate over large batches of parameters to be solved, at the end of each batch usingput!
to send the results to be saved as described in (3). - Fetch the value returned by
@spawnat
(which just confirms that all batches were evaluated)
The error returned is this:
┌ Warning: Parallelizing parameter sweep.
└ @ TravelingWaveSimulations ~/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:58
┌ Warning: max_held_batches = 16
└ @ TravelingWaveSimulations ~/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:155
Worker 13 terminated.
Worker 17 terminated.
Worker 9 terminated.
┌ Error: Fatal error on process 1
│ exception =
│ IOError: write: connection reset by peer (ECONNRESET)
│ Stacktrace:
│ [1] uv_write(::Sockets.TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:953
│ [2] unsafe_write(::Sockets.TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:920
│ [3] macro expansion at ./io.jl:593 [inlined]
│ [4] write at ./io.jl:616 [inlined]
│ [5] serialize_array_data at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:246 [inlined]
│ [6] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Array{Float64,2}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:263
│ [7] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Array{Array{Float64,2},1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:268 (repeats 2 times)
│ [8] serialize_any(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:629
│ [9] serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:608 [inlined]
│ [10] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Tuple{Nothing,NamedTuple{(:See, :stop_time, :Sie, :Sii, :Sei, :u, :t, :x),Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Array{Array{Float64,2},1},1},Array{Array{Float64,1},1},Array{Array{Tuple{Float64},1},1}}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:196
│ [11] serialize_msg(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Distributed.ResultMsg) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:90
│ [12] #invokelatest#1 at ./essentials.jl:709 [inlined]
│ [13] invokelatest at ./essentials.jl:708 [inlined]
│ [14] send_msg_(::Distributed.Worker, ::Distributed.MsgHeader, ::Distributed.ResultMsg, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:185
│ [15] send_msg_now at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:130 [inlined]
│ [16] send_msg_now(::Sockets.TCPSocket, ::Distributed.MsgHeader, ::Distributed.ResultMsg) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:125
│ [17] deliver_result(::Sockets.TCPSocket, ::Symbol, ::Distributed.RRID, ::Tuple{Nothing,NamedTuple{(:See, :stop_time, :Sie, :Sii, :Sei, :u, :t, :x),Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Array{Array{Float64,2},1},1},Array{Array{Float64,1},1},Array{Array{Tuple{Float64},1},1}}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:111
│ [18] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:302 [inlined]
│ [19] (::Distributed.var"#107#109"{Distributed.CallMsg{:call_fetch},Distributed.MsgHeader,Sockets.TCPSocket})() at ./task.jl:333
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:115
ERROR: LoadError: TaskFailedException:
ProcessExitedException(9)
...and 3 more exception(s).
Stacktrace:
[1] sync_end(::Array{Any,1}) at ./task.jl:300
[2] (::Distributed.var"#161#163"{TravelingWaveSimulations.var"#140#143"{TravelingWaveSimulations.var"###example#418",Array{Any,1},String,Array{Symbol,1},RemoteChannel{Channel{Tuple}}},Array{Array{Dict{Symbol,Any},1},1}})() at ./task.jl:319
Stacktrace:
[1] sync_end(::Array{Any,1}) at ./task.jl:300
[2] macro expansion at ./task.jl:319 [inlined]
[3] execute_modifications_parallel_saving(::Function, ::Array{Dict{Symbol,Any},5}, ::Array{Any,1}, ::String, ::String, ::Int64, ::Int64) at /home/grahams/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:161
[4] #based_on_example#133(::String, ::Bool, ::String, ::Array{Any,1}, ::Array{Any,1}, ::Int64, ::Int64, ::typeof(based_on_example)) at /home/grahams/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:62
[5] (::TravelingWaveSimulations.var"#kw##based_on_example")(::NamedTuple{(:data_root, :analyses, :batch, :no_save_raw, :max_sims_in_mem, :modifications, :example_name),Tuple{String,Array{Any,1},Int64,Bool,Int64,Array{Any,1},String}}, ::typeof(based_on_example)) at ./none:0
[6] top-level scope at /home/grahams/git/TravelingWaveSimulations/scripts/based_on_example.jl:45
[7] include at ./boot.jl:328 [inlined]
[8] include_relative(::Module, ::String) at ./loading.jl:1105
[9] include(::Module, ::String) at ./Base.jl:31
[10] exec_options(::Base.JLOptions) at ./client.jl:287
[11] _start() at ./client.jl:460
in expression starting at /home/grahams/git/TravelingWaveSimulations/scripts/based_on_example.jl:45
Note that only half the workers terminate (investigated via htop
on an interactive session). I can’t figure out if all the terminated workers are on the same node, but I’m guessing so.
Based on various print statements, the error occurs during step 4 before any data has been saved.