Handling errors in threads

My program keeps encountering errors in threads it has spawned; the only symptom I see is the main thread stalled trying to assign work. The tendency of threads and tasks to swallow errors silently is apparently long-standing, with some changes having been made which don’t reach my case. I would like to be able to uncover what is going wrong in the worker threads.

Any advice on how to handle this?

Some observations:

  1. wait on a task apparently will propagate the error. But I am waiting on a Channel, which has no error other than being full because the workers that service it have died.
  2. Getting the code to work without threads is somewhat delicate, given its tendency to block if things don’t happen in the right order.
  3. Despite that, I wrote a single-threaded test that has helped find errors. But then when I try to do something a bit different, I sometimes discover there is a new error.
  4. One thought is to execute the worker using try/catch and then using a separate channel to send the exception, or perhaps something less context dependent (e.g., a string with the stack trace) back to the main thread. But that seems like significant extra complexity.
  5. Or the exception handler inside the thread could just print to stderr. I’m not sure if that’s available outside the main thread.
  6. My interest is in discovering the details of what went wrong, not in continuing the computation. I expect that latter is generally impossible, since the errors are typically syntax errors or something similar (e.g., no method defined).

Here’s the core of the code that launches and manages the workers:

# not complete
    nT = Threads.nthreads()
    command = Channel(2*Threads.nthreads())
    # launch workers
    tasks = [Threads.@spawn worker(command, ml, ev) for i in 1:nT]
      
    # feed them jobs
    for iCluster in 1:nclusters
        # next line is where main thread eventually blocks
        # "cluster" here refers to the data, not to computation resources
        put!(command, ((iCluster-1)*nclustersize+1, iCluster*nclustersize, iCluster))
    end
    # let each know there is no more work
    for i in 1:nT
        put!(command, (-1, -1, -1))
    end

    # wait for them to finish
    for t in tasks
        wait(t)
    end

Hi, I was facing a similar issue where I was using a channel to store workspaces for a calculation I wanted to multi-thread and similarly didn’t find many resources on what to do when the program hangs. Luckily, my code has a separate serial mode where I was able to identify that errors were occuring. My multi-threading function, which allocates a solver from a channel to each task and expects the task to return the solver to the channel to make it available to another task, started like this:

function do_threaded_solve!(integrand, channel, f, y, x, p)
    @sync for (iy, xi) in zip(eachindex(y), x)
        solver = take!(channel)
        Threads.@spawn begin
            y[iy] = integrand(solver, f, xi, p) # errors in this line cause the program to hang
            put!(channel, solver)
        end
    end
end

However, I also wanted errors that occur during multi-threading to be thrown to the main process so I realized that I needed tasks to always return solvers to the channel so that the main process wouldn’t wait on an empty channel with a try/finally block. If I also rethrow the error in a catch block of the task then the @sync block waits on the tasks and will rethrow the errors so I can see them. Now my code is

function do_threaded_solve!(integrand, channel, f, y, x, p)
    @sync for (iy, xi) in zip(eachindex(y), x)
        solver = take!(channel)
        Threads.@spawn try
            y[iy] = integrand(solver, f, xi, p)
        catch e
            rethrow(e)
        finally
            put!(channel, solver)
        end
    end
end

and when the integrand call errors I get the following error

ERROR: LoadError: TaskFailedException

    nested task error [...]

... and 14 more exceptions

@Ross_Boylan in your case did you find a solution?