Reclaiming a worker on a long running process

Is the following possible:

  • Set a long running Julia process going on a server.
  • Log off the server, move on with life.
  • SSH back in at a later date.
  • Fire up a Julia repl and add the long running process as a worker or similar. E.g. to extract ad hoc information from data structures in memory without restarting.

I’m not sure how to attack this use case. Is it possible? If so, which docs/source should I read?

1 Like

I’m not sure about an entirely Julia based solution, but when I need to run long simulations on a cluster, I use the screen command. Maybe that would work for you as well?

2 Likes

SSH + screen works for this. Or if it’s a cluster the job scheduler, but you usually can’t make this interactive. Personally, I just use VNC on my own lab computers since when you log back in you get the same screen. This solves “getting back to the same process”.

However, for intermediate modifications and saving… what types of problems do you plan on solving? If it’s differential equations, the problem with restarting right now is because JLD has problems with saving functions. Otherwise the DifferentialEquations.jl’s integrator interface with a callback for intermediate saves to JLD would handle this just fine.

If it’s for optimization, you’d need to find a way to save some of the intermediate data. I believe the iterator interfaces which are being worked on in Optim and JuliaML have a way of letting you save and modify state.

But for details in “extract ad hoc information from data structures in memory without restarting”, this is very highly problem dependent and we’d have to know what you’re doing.

1 Like

The screen solution looks useful and wasn’t something I was aware of - might give that a go for some of my usage.

What I was imagining was more like when you call addprocs(n). In that case some Julia processes get created and we take ownership of them. Is there some way to not kill them off when exiting the REPL, and instead be able to reclaim them later?

I’m largely doing MCMC, and save my results to disk at each iteration anyhow for memory reasons. But I’m aiming for a more generic solution.