`MethodError` in Julia processes launched with `--worker` argument


#1

I wrote a my cluster manager to be used with our cluster running Slurm. Now running the code is as simple as

julia> a = ArettoManager(124)
ArettoManager(124)

julia> addprocs(a, partition="defq", N=16)

which correctly launches 124 Julia binaries with --worker argument successfully on 16 nodes. However, these worker processes exit immediately with a MethodError. Capturing STDOUT, I have

[affans@hpc ArettoTest]$ cat job0000.out
julia_worker:9070#172.16.1.26
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

As you can see, the worker binary correctly writes the port and host information on STDOUT.
The exception from the master Julia process itself

ERROR: InterruptException:
Stacktrace:
 [1] #addprocs_locked#44(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:partition, :N),Tuple{String,Int64}}}, ::Function, ::ArettoManager) at ./task.jl:266
 [2] #addprocs_locked at ./none:0 [inlined]
 [3] #addprocs#43(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:partition, :N),Tuple{String,Int64}}}, ::Function, ::ArettoManager) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:369
 [4] (::getfield(Distributed, Symbol("#kw##addprocs")))(::NamedTuple{(:partition, :N),Tuple{String,Int64}}, ::typeof(addprocs), ::ArettoManager) at ./none:0
 [5] top-level scope at none:0

I don’t know where the MethodError is coming from. It dosn’t seem to be from my side but something internally, likely in Distributed.jl.

Edit: Slightly different error in STDOUT from the workers

[affans@hpc ArettoTest]$ cat job0001.out
julia_worker:9015#172.16.1.26
TypeError(:deserialize_module, "typeassert", Module, ===)CapturedException(TypeError(:deserialize_module, "typeassert", Module, ===), Any[((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.