Is there a way to abort a task that does not yield?

I searched for running a task with a timeout, but all the answers I found used async tasks which won’t work because the whole point is that the code doesn’t yield.

I understand that killing a thread is generally bad, but I ran into a particular case that seems justified: if there’s a large calculation in code that never yields and you can’t modify the code to yield reasonably (eg. a library). You need to abort it if it takes too long (eg. so it doesn’t hang your repl such that even ctrl+c won’t stop it, which happened to me today).

I found one threaded example, and that appears to do what I need. But notice the example sleeps: which means it yields nicely. What about something that doesn’t yield? It crashes the repl apparently:

Threads.nthreads() # output: 24

function run_timeout(f, seconds)
    t = Threads.@spawn begin
        try
            f()
        catch
            # println("Thread interrupted")
        end
    end
    Timer(seconds) do _ # also tried spawning this in separate thread
    # Threads.@spawn Timer(seconds) do _
        try
            if !istaskdone(t)
                println("Task did not complete in time. Aborting.")
                # Base.throwto(t, InterruptException())
                schedule(t, InterruptException(), error=true)
            else
                println("Task completed within seconds.")
            end
        catch
            # println("Error around timer")
        end
    end
end

# This works fine and does not print 'done' which is expected.
run_timeout(() -> ( sleep(10) ; println("done") ), 1.0)

spin() = reduce((x, a) -> log(x+a)%12341234, 1:100000000; init=1)
@time spin(); # output: 13.953269 seconds

# Here's the problem:
run_timeout(spin, 0.01);

#= output:
Task did not complete in time. Aborting. # displayed after 1 second
# REPL hangs for the same time spin() takes to run then it crashes with:
fatal: error thrown and no exception handler available.
20.7233
try_yieldto at .\task.jl:931
wait at .\task.jl:995
task_done_hook at .\task.jl:675
jfptr_task_done_hook_79050.1 at C:\app\dev\Julia\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
jl_finish_task at C:/workdir/src\task.c:320
start_task at C:/workdir/src\task.c:1249
=#
# REPL has crashed

I tried commenting out the println’s in case they might have been related, but they aren’t.
Particularly surprising is the REPL hung even when I spawned the Timeout call.

So, this tells me that the solution in the link above does not kill a task/thread. It schedules a task on that thread that won’t run until it yields, so it will do nothing for unyielding tasks. Crashing is also a problem.

Everything I’m reading about this type of problem seems to be about how to do it nicely, resource release concerns and such. But I’m talking about the case when a thread is bad, a misbehaving comprehensive consumer of your CPU, and I want to kill it with fire. I can understand there might be an argument that at that point maybe it would be better to kill the whole process and start from scratch. However, particularly in the Julia REPL world where it can take 5 minutes or more to get back to the same state due to compilation, data loading, etc., it would be nice to not have to do that.

There are a few ways to do this if you can run your task on a worker process.

Timeout.jl/src/Timeout.jl at master · ararslan/Timeout.jl · GitHub uses rc = ccall(:uv_kill, Cint, (Cint, Cint), ospid, SIGTERM) which is definately a “kill it with fire” method.

There is also GitHub - JuliaPluto/Malt.jl: Simple multiprocessing for Julia which is used by Pluto and seems to have a nice API for stopping or interupting workers: Malt.jl · Malt

I’m not sure how to do this with threads but maybe the interactive threadpool could help?

2 Likes

Unfortunately there is no general safe way of aborting an uncooperative thread.

The issue is that a thread might be in some critical section (holds a lock) and nuking the thread will leave this lock taken. Or some other datastructures might be in a temporarily inconsistent state. In a C library that is called from your julia code.

It is helpful to read pthread_cancel(3) - Linux manual page – for good reason, the default is deferred.

Safe cancellation requires some kind of isolation and tracking of state that the thread owns. Such an isolated thing is called “process”, and can be safely nuked from orbit, a la kill -9; and if that is not enough (spawned child processes…) then cgroup will do the job.

So if your runaway thread/task doesn’t yield and doesn’t hit safepoints, you might really want an unsafe YOLO way.

Afaiu there is no library for that.

If I were to do that, I would try installing a signal handler, and then sending something to the process. That should preempt all threads. The signal handler then needs to check whether it runs on the offending thread, and if so walk up the stack until it finds the entry point, and return to there. Now all finally blocks in julia will be skipped, as will all external exception handlers (if you’re currently in some C++ code, or some java code called from julia), so this is really “YOLO”.

(nuking the thread is no good – then the scheduler will soon blow up. You need some minimal cleanup)

From a compiler/language viewpoint, this is absolutely undefined behavior; and exit(0), is a very benign and simple implementation of UB, hence everyone will tell you to just kill the process.

It would be a fun hack, though – please post if you decide to go for that. I think there should be recipes flying around in the security space – that should be a common enough setting there (have taken over one thread via exploit, need to kill/modify some other uncooperative threads, with blowup due to inconsistencies rare and late enough to have time to fully pwn the system). Will be very OS dependent.

3 Likes

Thank you both for the informative replies. It’s interesting to note that from that pthread_cancel link, the required cancellation points appear to all be IO related… so that wouldn’t help if a thread is just number crunching. It appears the answer is that if one really needs to be able to kill it, it would have to be in its own process.

That said, though, should the code above be crashing the REPL? If one is willing to just have the rogue thread play out but stop whenever it does yield… isn’t that code the way to accomplish that? Is that crash a bug? Or am I just doing it wrong?