This is related to my previous question which was lacking a lot of information and was not reproducible. In short, in Julia’s Distributed
library, I believe there is a bug that’s preventing me from connecting to our university cluster which uses Slurm. I’ve tried ClusterManagers.jl
as well as rolled my own. It’s the same error.
The error is
[affans@hpc ArettoTest]$ tail -f job0000.out
julia_worker:9009#172.16.1.26
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
As you can see the worker Julia binary is launched, writes its host/port to STDOUT, and then crashes. I’ve narrowed down where this exception occurs. It happens in the message_handler_loop()
function in process_messages.jl
file (lines 161 - 259). Direct link to source code: julia/process_messages.jl at master · JuliaLang/julia · GitHub
Line 226 and 227 is where the exception gets printed and matches the error message.
I don’t see what’s throwing the exception though. There are references to array.jl
and event.jl
in the trace but I doubt the exception is happening in such a low level. There are also a few functions called within message_handler_loop()
that I have no idea what they do.
Does anyone who have more expertise with TCPSocket
communication help me figure out whats happening? I feel like this is a simply a bug from 0.6 - 1.0 transition. Maybe something to do with with a convert
function. I hate to ask, but this is slightly urgent for me since we are doing research that has a deadline of just 30 days and this is way out of my expertise.
Sometimes the exception is slightly different, but the problem is the same I believe given the exception.
[affans@hpc ArettoTest]$ cat job0001.out
julia_worker:9009#172.16.1.27
TypeError(:deserialize_module, "typeassert", Module, ===)CapturedException(TypeError(:deserialize_module, "typeassert", Module, ===), Any[((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Completely unrelated question: Is there is a way to visualize the call graph of functions called starting from addprocs()
instead of manually going through the course code?
Thanks