Handling errors in threads

Ross_Boylan · February 9, 2023, 1:02am

My program keeps encountering errors in threads it has spawned; the only symptom I see is the main thread stalled trying to assign work. The tendency of threads and tasks to swallow errors silently is apparently long-standing, with some changes having been made which don’t reach my case. I would like to be able to uncover what is going wrong in the worker threads.

Any advice on how to handle this?

Some observations:

wait on a task apparently will propagate the error. But I am waiting on a Channel, which has no error other than being full because the workers that service it have died.
Getting the code to work without threads is somewhat delicate, given its tendency to block if things don’t happen in the right order.
Despite that, I wrote a single-threaded test that has helped find errors. But then when I try to do something a bit different, I sometimes discover there is a new error.
One thought is to execute the worker using try/catch and then using a separate channel to send the exception, or perhaps something less context dependent (e.g., a string with the stack trace) back to the main thread. But that seems like significant extra complexity.
Or the exception handler inside the thread could just print to stderr. I’m not sure if that’s available outside the main thread.
My interest is in discovering the details of what went wrong, not in continuing the computation. I expect that latter is generally impossible, since the errors are typically syntax errors or something similar (e.g., no method defined).

Here’s the core of the code that launches and manages the workers:

# not complete
    nT = Threads.nthreads()
    command = Channel(2*Threads.nthreads())
    # launch workers
    tasks = [Threads.@spawn worker(command, ml, ev) for i in 1:nT]
      
    # feed them jobs
    for iCluster in 1:nclusters
        # next line is where main thread eventually blocks
        # "cluster" here refers to the data, not to computation resources
        put!(command, ((iCluster-1)*nclustersize+1, iCluster*nclustersize, iCluster))
    end
    # let each know there is no more work
    for i in 1:nT
        put!(command, (-1, -1, -1))
    end

    # wait for them to finish
    for t in tasks
        wait(t)
    end