There is a bug in this function and I can't figure out what it is

This is related to my previous question which was lacking a lot of information and was not reproducible. In short, in Julia’s Distributed library, I believe there is a bug that’s preventing me from connecting to our university cluster which uses Slurm. I’ve tried ClusterManagers.jl as well as rolled my own. It’s the same error.

The error is

[affans@hpc ArettoTest]$ tail -f job0000.out
julia_worker:9009#172.16.1.26
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

As you can see the worker Julia binary is launched, writes its host/port to STDOUT, and then crashes. I’ve narrowed down where this exception occurs. It happens in the message_handler_loop() function in process_messages.jl file (lines 161 - 259). Direct link to source code: julia/process_messages.jl at master · JuliaLang/julia · GitHub

Line 226 and 227 is where the exception gets printed and matches the error message.

I don’t see what’s throwing the exception though. There are references to array.jl and event.jl in the trace but I doubt the exception is happening in such a low level. There are also a few functions called within message_handler_loop() that I have no idea what they do.

Does anyone who have more expertise with TCPSocket communication help me figure out whats happening? I feel like this is a simply a bug from 0.6 - 1.0 transition. Maybe something to do with with a convert function. I hate to ask, but this is slightly urgent for me since we are doing research that has a deadline of just 30 days and this is way out of my expertise.

Sometimes the exception is slightly different, but the problem is the same I believe given the exception.

[affans@hpc ArettoTest]$ cat job0001.out
julia_worker:9009#172.16.1.27
TypeError(:deserialize_module, "typeassert", Module, ===)CapturedException(TypeError(:deserialize_module, "typeassert", Module, ===), Any[((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

Completely unrelated question: Is there is a way to visualize the call graph of functions called starting from addprocs() instead of manually going through the course code?

Thanks

Really hard to say without more information, but note that the bug isn’t in convert or arrays itself, and may not be the message-handling loop itself, either — it runs code as it processes messages, including your code.

Just as a bit more of a breadcrumb, the issue is stemming from assigning a symbol into an array of tuples. This may be happening when setting up your topology (given which symbol is in the backtrace). E.g.,

julia> A = Vector{Tuple}(undef, 1)
       A[1] = :all_to_all
ERROR: MethodError: Cannot `convert` an object of type Symbol to an object of type Tuple

Not sure where that array of tuples is coming from, or where that assignment is happening, though.

@mbauman Is there a way to track this issue? I’ve put in println() statements all over process_messages.jl file . In particular, I put a println statements in process_messages, process_tcp_streams and it prints successfully. However, I have

function message_handler_loop(r_stream::IO, w_stream::IO, incoming::Bool)
    wpid=0          # the worker r_stream is connected to.
    boundary = similar(MSG_BOUNDARY)
    println("hello from message handler")
    ,,,
end

and yet that line never gets printed. I am not sure how the function can be “called” and not have that line printed. The function must be called because it is catching the exception… atleast I think it’s catching the exception. Maybe it has something to do with the @asyncs?

What can I do to help, how can I get your more information? Maybe I can get you access to our cluster so you have a running machine to test on?

edit: there are two error messages: Whats common between them is the line

::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})()

which looks like the signature of message_handler_loop()
Any idea?

reply 2: Oh man, I am so confused. In the message_handler_loop() function, I added the following line

            println(stderr, e, CapturedException(e, catch_backtrace()))
            println(stderr, "Process($(myid())) - Unknown remote, closing connection.")
            println(stderr, "hello from exception thrown")

It dosn’t print the last line! My theory is that the message_handler_function() thats passed to the worker binary is the “original” and not my modified one. I don’t get it… if its printing out the top two lines, it should print the last println also…

@mbauman Ah! It’s a version problem (I think!). The cluster’s head node has 1.0.1 installed while the compute nodes were on 0.6.

After a million println statements and @mauro3 wonderful idea of eval directly into the module, here is what’s going wrong: After running addprocs(ClusterManager), the function create_worker sets up individual workers and mainly establishes a read/write stream (TCPSocket). It then creates a message join_message = JoinPGRPMsg(w.id, all_locs, PGRP.topology, enable_threaded_blas, isclusterlazy()) and calls send_msg_now(...) which eventually calls write(io, MSG_BOUNDARY).

So far I’ve understood this as the worker writing a message to master saying “I am ready”.

Now, through some black magic, when the write function is called, another function message_handler_loop is called simultaneously… I am guessing this function is “hooked” into the worker’s read/write stream so everyt ime write happens, this function is executed. (Is this right? How does this happen?)

Anyways, in the function message_handler_loop, the line version = process_hdr(r_stream, incoming), which processes the “header” message, is how I found the problem. In process_hdr, the body has

    if length(version) < HDR_VERSION_LEN
        println("about to throw an error")
        error("Version read failed. Connection closed by peer.")
    end

which throws the error because of incompatible versions installed. Now the error here makes sense, and if this is what was printed … I would’ve figured out my problem a long time ago.

However, I am guessing with all the asyncs and various functions being executed at the same time, somehow this error message gets lost in the stack… the message_handler_loop part is in a try/catch but the exception message thrown there is vastly different than what’s happening.

(Maybe the error message is about the Julia 0.6 version installed on compute nodes. Maybe it’s trying to convert a symbol into a tuple and failing but no idea how that is happening).

I will open an issue on github. Although, I have a feeling that with 1.1, this is actually fixed by
https://github.com/JuliaLang/julia/pull/28878

Now, hopefully I am right. When the sysadmin upgrades the compute nodes tomorrow, I hope everything is fixed.

Sounds like a pretty epic piece of debugging! For what it’s worth I think this explanation is completely plausible. Anecdotally I’ve heard about similar woes related to the intersection of Distributed and unreliable exception reporting, from @vchuravy I think.

message_handler_loop is called inside an @async, so any time another task yields control, the task which is running message_handler_loop may be scheduled next. The confusing thing is that there’s a few functions which can cause task switching without the user generally being aware of it. For example write(::IO, ...) may do this depending on the specific subtype of IO. In particular, I think this can happen when the specific IO type is registered with the libuv event loop. So not currently for file IO (in base/iostream.jl), but for pipes and TTYs which are defined in base/stream.jl and which call uv_*_init passing the julia global event loop. Generally you should be aware that IO functions might yield, depending on the underlying device.

(Note that julia doesn’t have true parallel tasks yet, so task execution is done in serial by the main event loop. This doesn’t prevent task switching from being confusing, though it does rule out a certain class of truly nasty concurrent bugs.)