I have no idea how to debug this problem, so any even vague “that sounds like…” is appreciated.
My program works when distributed within a single node, but fails when given tasks/cpus that span two nodes. I unfortunately haven’t been able to make a MWE. Simple attempts to replicate the problem only show that, ostensibly, both Julia and the slurm cluster work just fine with each other. The program goes as follows:
- start 
julia -p Pthrough a slurm batch job assignedPtasks. - set up a 
RemoteChannelthat can hold a number of results somewhat less than will fit into a single task’s memory - use 
@spawnatto run a function thattakes from theRemoteChanneland save the result out to a drive shared across the cluster - use 
@distributedto iterate over large batches of parameters to be solved, at the end of each batch usingput!to send the results to be saved as described in (3). - Fetch the value returned by 
@spawnat(which just confirms that all batches were evaluated) 
The error returned is this:
┌ Warning: Parallelizing parameter sweep.
└ @ TravelingWaveSimulations ~/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:58
┌ Warning: max_held_batches = 16
└ @ TravelingWaveSimulations ~/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:155
Worker 13 terminated.
Worker 17 terminated.
Worker 9 terminated.
┌ Error: Fatal error on process 1
│   exception =
│    IOError: write: connection reset by peer (ECONNRESET)
│    Stacktrace:
│     [1] uv_write(::Sockets.TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:953
│     [2] unsafe_write(::Sockets.TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:920
│     [3] macro expansion at ./io.jl:593 [inlined]
│     [4] write at ./io.jl:616 [inlined]
│     [5] serialize_array_data at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:246 [inlined]
│     [6] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Array{Float64,2}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:263
│     [7] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Array{Array{Float64,2},1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:268 (repeats 2 times)
│     [8] serialize_any(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:629
│     [9] serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:608 [inlined]
│     [10] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Tuple{Nothing,NamedTuple{(:See, :stop_time, :Sie, :Sii, :Sei, :u, :t, :x),Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Array{Array{Float64,2},1},1},Array{Array{Float64,1},1},Array{Array{Tuple{Float64},1},1}}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:196
│     [11] serialize_msg(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Distributed.ResultMsg) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:90
│     [12] #invokelatest#1 at ./essentials.jl:709 [inlined]
│     [13] invokelatest at ./essentials.jl:708 [inlined]
│     [14] send_msg_(::Distributed.Worker, ::Distributed.MsgHeader, ::Distributed.ResultMsg, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:185
│     [15] send_msg_now at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:130 [inlined]
│     [16] send_msg_now(::Sockets.TCPSocket, ::Distributed.MsgHeader, ::Distributed.ResultMsg) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:125
│     [17] deliver_result(::Sockets.TCPSocket, ::Symbol, ::Distributed.RRID, ::Tuple{Nothing,NamedTuple{(:See, :stop_time, :Sie, :Sii, :Sei, :u, :t, :x),Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Array{Array{Float64,2},1},1},Array{Array{Float64,1},1},Array{Array{Tuple{Float64},1},1}}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:111
│     [18] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:302 [inlined]
│     [19] (::Distributed.var"#107#109"{Distributed.CallMsg{:call_fetch},Distributed.MsgHeader,Sockets.TCPSocket})() at ./task.jl:333
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:115
ERROR: LoadError: TaskFailedException:
ProcessExitedException(9)
...and 3 more exception(s).
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:300
 [2] (::Distributed.var"#161#163"{TravelingWaveSimulations.var"#140#143"{TravelingWaveSimulations.var"###example#418",Array{Any,1},String,Array{Symbol,1},RemoteChannel{Channel{Tuple}}},Array{Array{Dict{Symbol,Any},1},1}})() at ./task.jl:319
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:300
 [2] macro expansion at ./task.jl:319 [inlined]
 [3] execute_modifications_parallel_saving(::Function, ::Array{Dict{Symbol,Any},5}, ::Array{Any,1}, ::String, ::String, ::Int64, ::Int64) at /home/grahams/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:161
 [4] #based_on_example#133(::String, ::Bool, ::String, ::Array{Any,1}, ::Array{Any,1}, ::Int64, ::Int64, ::typeof(based_on_example)) at /home/grahams/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:62
 [5] (::TravelingWaveSimulations.var"#kw##based_on_example")(::NamedTuple{(:data_root, :analyses, :batch, :no_save_raw, :max_sims_in_mem, :modifications, :example_name),Tuple{String,Array{Any,1},Int64,Bool,Int64,Array{Any,1},String}}, ::typeof(based_on_example)) at ./none:0
 [6] top-level scope at /home/grahams/git/TravelingWaveSimulations/scripts/based_on_example.jl:45
 [7] include at ./boot.jl:328 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:1105
 [9] include(::Module, ::String) at ./Base.jl:31
 [10] exec_options(::Base.JLOptions) at ./client.jl:287
 [11] _start() at ./client.jl:460
in expression starting at /home/grahams/git/TravelingWaveSimulations/scripts/based_on_example.jl:45
Note that only half the workers terminate (investigated via htop on an interactive session).  I can’t figure out if all the terminated workers are on the same node, but I’m guessing so.
Based on various print statements, the error occurs during step 4 before any data has been saved.