This is my first time posting here, and I am posting after having read through a number of similar posts on Julia Discourse. My problem in a nutshell is that when I run a script on PBS, where the script does some parallel processing, the script doesn’t terminate when all the processes have finished executing. I have tried using interrupt() and rmprocs(), and they do terminate all the processes except process 1, which continues to exist doing nothing until PBS force quits the script after exceeding the time limit.
Here is a sample code that I’m running on PBS HPC
using Distributed
addprocs(length(Sys.cpu_info()))
@everywhere begin
DO SOMETHING PARALLEL
end
println(Result from above computation)
rmprocs(length(Sys.cpu_info()))
exit()
and here is my PBS Script
#!/bin/bash
#PBS -N simulation
#PBS -l select=1:ncpus=64
#PBS -l walltime=1:00:00
# Setup Julia
export PATH="$PATH:/home/users/scratch/julia-1.8.2/bin"
julia main.jl
What baffles me is that I can’t seem to reproduce this on my local machine or on any other servers. This only occurs when running the script on PBS. If anyone has experienced similar to this and has any suggestions and alternatives, they would be much appreciated.
Hi @jmair, thanks for the reply. I did try using what you recommended. The PBSManager takes two arguments and not one (addprocs(PBSManager(np, qsub_flags))), and I couldn’t get it to run at all, even when following the example straight up from the documentation. I also tried the other alternative mentioned in the documentation addprocs_pbs(np::Integer) which gave me an error of the following:
From reading the ClusterManagers documentation I think their library is made for writing Julia scripts that work directly with the HPC cluster without the aid of an additional PBS script. Nevertheless, I still thank you for your suggestion.
Ah, it’s a shame, as the SLURM manager works a bit differently (you still have a script to allocate resources then addprocs spawns the processes). It would be nice to see if someone else here uses PBS as you would be able to easily scale across multiple nodes.
Hi @jmair, the rmprocs change did not do anything. It removed all the worker processes but process 1 still exists idly until PBS force quits the program at walltime.
Hi @jishnub, I don’t think adding pkill -9 julia command at the end of my PBS script would work as PBS would still be held up by the julia main.jl command. However, I was able to force terminate the process by adding it to the end of my Julia script as follows:
rmprocs(workers())
run(`pkill -9 julia`)
I think this effectively resolves my issue. Many thanks to both @jmair and @jishnub for providing me with alternatives to try around.
Sounds like a bug though, the program shouldn’t require a force-terminate. Could you file an issue on the Julia repo? It would be good to have this looked into