Save a local variable when program is killed

I have a Julia program that performs some long operation on a variable, and I hope that when the program is killed by a job management system such as SLURM, the program can save the current local state of the variable to disk. Here is a minimal (or somewhat trivial) demonstration of what I want to do:

function main(x::Int, step::Int)
    @assert step > 0
    for i in 1:step
        x += 1
        sleep(1)
    end
    return x
end

function cleanup()
    println(x)
    return nothing
end
atexit(cleanup)

x = 0
println("PID = ", getpid())
for step in [10, 12, 14, 16]
    global x = main(x, step)
    println("Current x = ", x)
end

In this example, the function main adds step to x. I hope that when the program is terminated by a SIGTERM, it can save the local value of x inside the function main. If I put the atexit outside main as shown in the example, I will get the following error when I kill the program in terminal:

[19672] signal 15: Terminated: 15
in expression starting at test_atexit.jl:18
kevent at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
Allocations: 10826481 (Pool: 10826120; Big: 361); GC: 9
22schedule: Task not runnable
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] schedule(t::Task, arg::Any; error::Bool)
    @ Base ./task.jl:884
  [3] schedule
    @ ./task.jl:876 [inlined]
  [4] uv_writecb_task(req::Ptr{Nothing}, status::Int32)
    @ Base ./stream.jl:1200
  [5] poptask(W::Base.IntrusiveLinkedListSynchronized{Task})
    @ Base ./task.jl:1012
  [6] wait()
    @ Base ./task.jl:1021
  [7] uv_write(s::Base.TTY, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1081
  [8] unsafe_write(s::Base.TTY, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1154
  [9] write
    @ ./strings/io.jl:248 [inlined]
 [10] show
    @ ./show.jl:1247 [inlined]
 [11] print(io::Base.TTY, x::Int64)
    @ Base ./strings/io.jl:35
 [12] print(::Base.TTY, ::Int64, ::String)
    @ Base ./strings/io.jl:46
 [13] println(io::Base.TTY, xs::Int64)
    @ Base ./strings/io.jl:75
 [14] println(xs::Int64)
    @ Base ./coreio.jl:4
 [15] cleanup()
    @ Main ~/test_atexit.jl:11
 [16] _atexit(exitcode::Int32)
    @ Base ./initdefs.jl:459

I think this is because cleanup does not have access to the local x in main. However, if I put it inside main and register it there:

function main(x::Int, step::Int)
    function cleanup()
        println(x)
        return nothing
    end
    atexit(cleanup)
    
    @assert step > 0
    for i in 1:step
        x += 1
        sleep(1)
    end
    return x
end

the for-loop in the Main module will register cleanup multiple times, and give the following output when not killed, which is not what I want:

PID = 19709
Current x = 10
Current x = 22
Current x = 36
Current x = 52
52
36
22
10

What I want is whenever the program is killed, it can print the current local value of x inside the main function. How can I achieve this? Thank you!

Perhaps I’m misunderstanding what you precisely want, but can’t you just make main use the global variable x?

function main(step::Int)
    @assert step > 0
    global x
    for i in 1:step
        x += 1
        sleep(1)
    end
end

function cleanup()
    println(x)
    return nothing
end
atexit(cleanup)

x = 0
println("PID = ", getpid())
for step in [10, 12, 14, 16]
    main(step)
    println("Current x = ", x)
end

(Or with const x = Ref(0) and x[] everywhere.)

PID = 29128
Current x = 10
14   # (from stopping the process 3-4 s after the previous print)

I can indeed rewrite the actual program I’m using in this way, but is it really cumbersome to save the state of a local variable?

Maybe you can just use catch?

function main(x::Int, step::Int)
    try    
        @assert step > 0
        for i in 1:step
            x += 1
            sleep(1)
        end
        return x
    catch
        # cleanup 
        println(x)
        rethrow()
    end
end

I’d like to add that none of these methods will ever be 100% fail-proof. If your process is killed due to OOM or similarly drastical events, then cleanup won’t occur. If this output is for determining how far your computation did progress, then I propose to log values at sensible time intervals instead and not rely on some “cleanup” function.

3 Likes

For simple applications, a try-catch block might suffice, but it’s not ideal for safely controlling shutdown in more complex applications. Refer to These notes for more information.

Visor.jl could be useful for your use case.

For long-running apps like your example, you have two shutdown options: a controlled method
using the isshutdown checkpoint, or a forced interrupt via an exception.

Controlled:

using Visor

function mytask(pd)
    x = 0
    println("starting mytask ...")
    while true
        x += 1

        # do some work ...
        sleep(0.1)

        if isshutdown(pd)
            println("shutdown: saving local state x=$x")
            break
        end
    end
end

supervise(process(mytask))

Forced:

using Visor

function mytask(pd)
    x = 0
    println("starting mytask ...")
    try
        while true
            x += 1
            # do some work ...
            sleep(0.1)
        end
    catch
    finally
        println("shutdown: saving local state x=$x")
    end
end

supervise(process(mytask, force_interrupt_after=0))

Both approaches allow you to manage the task’s local state when shutting down.

Yes. A local variable by definition is only directly accessible in a particular scope; if such a scope is a method (1st main), then you are intentionally making it unavailable to independent methods (1st cleanup).

Given the inability to register only the last cleanup closure before SIGTERM, the following paragraph won’t help your use case, it’s just a comment on your attempt to use closures to carry the local variable’s state out of its scope. As you experienced, methods that interact with the variable itself must still be defined inside that scope (2nd cleanup inside 2nd main). Given the current implementation of closures as callable objects, you can interact with the state of captured variables with fully independent methods, but that’s a trick, not a stable language feature. If you don’t want your methods stuck as closures, better to forget about capturing local variables and use callable objects directly.

Base Julia doesn’t have sophisticated signal handling because what and when actions are safe for a signal handler is tricky even in ideal cases where your program isn’t obstructed by serious issues. This is especially true for languages with a significant runtime doing things outside your control; this short subthread shows a couple examples of difficulties and how a language may work around them. If possible, not relying on the timing of a particular signal from a particular source is more reliable; you may want some things to progress even if something doesn’t go according to plan.