Debugging RemoteChannel memory error during distributed computation

I have no idea how to debug this problem, so any even vague “that sounds like…” is appreciated.

My program works when distributed within a single node, but fails when given tasks/cpus that span two nodes. I unfortunately haven’t been able to make a MWE. Simple attempts to replicate the problem only show that, ostensibly, both Julia and the slurm cluster work just fine with each other. The program goes as follows:

  1. start julia -p P through a slurm batch job assigned P tasks.
  2. set up a RemoteChannel that can hold a number of results somewhat less than will fit into a single task’s memory
  3. use @spawnat to run a function that takes from the RemoteChannel and save the result out to a drive shared across the cluster
  4. use @distributed to iterate over large batches of parameters to be solved, at the end of each batch using put! to send the results to be saved as described in (3).
  5. Fetch the value returned by @spawnat (which just confirms that all batches were evaluated)

The error returned is this:

┌ Warning: Parallelizing parameter sweep.
└ @ TravelingWaveSimulations ~/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:58
┌ Warning: max_held_batches = 16
└ @ TravelingWaveSimulations ~/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:155
Worker 13 terminated.
Worker 17 terminated.
Worker 9 terminated.
┌ Error: Fatal error on process 1
│   exception =
│    IOError: write: connection reset by peer (ECONNRESET)
│    Stacktrace:
│     [1] uv_write(::Sockets.TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:953
│     [2] unsafe_write(::Sockets.TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:920
│     [3] macro expansion at ./io.jl:593 [inlined]
│     [4] write at ./io.jl:616 [inlined]
│     [5] serialize_array_data at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:246 [inlined]
│     [6] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Array{Float64,2}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:263
│     [7] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Array{Array{Float64,2},1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:268 (repeats 2 times)
│     [8] serialize_any(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:629
│     [9] serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:608 [inlined]
│     [10] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Tuple{Nothing,NamedTuple{(:See, :stop_time, :Sie, :Sii, :Sei, :u, :t, :x),Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Array{Array{Float64,2},1},1},Array{Array{Float64,1},1},Array{Array{Tuple{Float64},1},1}}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Serialization/src/Serialization.jl:196
│     [11] serialize_msg(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Distributed.ResultMsg) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:90
│     [12] #invokelatest#1 at ./essentials.jl:709 [inlined]
│     [13] invokelatest at ./essentials.jl:708 [inlined]
│     [14] send_msg_(::Distributed.Worker, ::Distributed.MsgHeader, ::Distributed.ResultMsg, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:185
│     [15] send_msg_now at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:130 [inlined]
│     [16] send_msg_now(::Sockets.TCPSocket, ::Distributed.MsgHeader, ::Distributed.ResultMsg) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/messages.jl:125
│     [17] deliver_result(::Sockets.TCPSocket, ::Symbol, ::Distributed.RRID, ::Tuple{Nothing,NamedTuple{(:See, :stop_time, :Sie, :Sii, :Sei, :u, :t, :x),Tuple{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Array{Array{Float64,2},1},1},Array{Array{Float64,1},1},Array{Array{Tuple{Float64},1},1}}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:111
│     [18] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:302 [inlined]
│     [19] (::Distributed.var"#107#109"{Distributed.CallMsg{:call_fetch},Distributed.MsgHeader,Sockets.TCPSocket})() at ./task.jl:333
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:115
ERROR: LoadError: TaskFailedException:
ProcessExitedException(9)

...and 3 more exception(s).

Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:300
 [2] (::Distributed.var"#161#163"{TravelingWaveSimulations.var"#140#143"{TravelingWaveSimulations.var"###example#418",Array{Any,1},String,Array{Symbol,1},RemoteChannel{Channel{Tuple}}},Array{Array{Dict{Symbol,Any},1},1}})() at ./task.jl:319
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:300
 [2] macro expansion at ./task.jl:319 [inlined]
 [3] execute_modifications_parallel_saving(::Function, ::Array{Dict{Symbol,Any},5}, ::Array{Any,1}, ::String, ::String, ::Int64, ::Int64) at /home/grahams/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:161
 [4] #based_on_example#133(::String, ::Bool, ::String, ::Array{Any,1}, ::Array{Any,1}, ::Int64, ::Int64, ::typeof(based_on_example)) at /home/grahams/git/TravelingWaveSimulations/src/script_helpers/based_on_example.jl:62
 [5] (::TravelingWaveSimulations.var"#kw##based_on_example")(::NamedTuple{(:data_root, :analyses, :batch, :no_save_raw, :max_sims_in_mem, :modifications, :example_name),Tuple{String,Array{Any,1},Int64,Bool,Int64,Array{Any,1},String}}, ::typeof(based_on_example)) at ./none:0
 [6] top-level scope at /home/grahams/git/TravelingWaveSimulations/scripts/based_on_example.jl:45
 [7] include at ./boot.jl:328 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:1105
 [9] include(::Module, ::String) at ./Base.jl:31
 [10] exec_options(::Base.JLOptions) at ./client.jl:287
 [11] _start() at ./client.jl:460
in expression starting at /home/grahams/git/TravelingWaveSimulations/scripts/based_on_example.jl:45

Note that only half the workers terminate (investigated via htop on an interactive session). I can’t figure out if all the terminated workers are on the same node, but I’m guessing so.

Based on various print statements, the error occurs during step 4 before any data has been saved.