Distributed.jl Processes not Shutting Down until Walltime on PBS

Hi All,

This is my first time posting here, and I am posting after having read through a number of similar posts on Julia Discourse. My problem in a nutshell is that when I run a script on PBS, where the script does some parallel processing, the script doesn’t terminate when all the processes have finished executing. I have tried using interrupt() and rmprocs(), and they do terminate all the processes except process 1, which continues to exist doing nothing until PBS force quits the script after exceeding the time limit.

Here is a sample code that I’m running on PBS HPC

using Distributed
addprocs(length(Sys.cpu_info()))

@everywhere begin
 
DO SOMETHING PARALLEL

end

println(Result from above computation)

rmprocs(length(Sys.cpu_info()))
exit()

and here is my PBS Script

#!/bin/bash
#PBS -N simulation
#PBS -l select=1:ncpus=64
#PBS -l walltime=1:00:00
# Setup Julia
export PATH="$PATH:/home/users/scratch/julia-1.8.2/bin"
julia main.jl

What baffles me is that I can’t seem to reproduce this on my local machine or on any other servers. This only occurs when running the script on PBS. If anyone has experienced similar to this and has any suggestions and alternatives, they would be much appreciated.

I am not sure this will help, but is there a reason you aren’t using ClusterManagers.jl to add the other processes (ClusterManagers · Julia Packages)?

using Distributed
using ClusterManagers
nw=length(Sys.cpu_info())
addprocs(PBSManager(nw))

I don’t know if this will be any different (I have never used PBS), but it could be worth trying.

Edit:

This should be:

rmprocs(workers())
1 Like

Hi @jmair, thanks for the reply. I did try using what you recommended. The PBSManager takes two arguments and not one (addprocs(PBSManager(np, qsub_flags))), and I couldn’t get it to run at all, even when following the example straight up from the documentation. I also tried the other alternative mentioned in the documentation addprocs_pbs(np::Integer) which gave me an error of the following:

qsub: invalid option -- 'w'
qsub: invalid option -- 'd'
usage: qsub [-a date_time] [-A account_string] [-c interval]
        [-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
        [-k o|e|oe] [-l resource_list] [-m mail_options] [-M user_list]
        [-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
        [-S path] [-u user_list] [-W otherattributes=value...]
        [-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
       qsub --version
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Distribute$

From reading the ClusterManagers documentation I think their library is made for writing Julia scripts that work directly with the HPC cluster without the aid of an additional PBS script. Nevertheless, I still thank you for your suggestion.

1 Like

Ah, it’s a shame, as the SLURM manager works a bit differently (you still have a script to allocate resources then addprocs spawns the processes). It would be nice to see if someone else here uses PBS as you would be able to easily scale across multiple nodes.

Did the rmprocs change do anything?

Hi @jmair, the rmprocs change did not do anything. It removed all the worker processes but process 1 still exists idly until PBS force quits the program at walltime.

This is quite strange. Could you try adding a pkill -9 julia command at the end of your PBS script to see if this force terminates the process?

Hi @jishnub, I don’t think adding pkill -9 julia command at the end of my PBS script would work as PBS would still be held up by the julia main.jl command. However, I was able to force terminate the process by adding it to the end of my Julia script as follows:

rmprocs(workers())
run(`pkill -9 julia`)

I think this effectively resolves my issue. Many thanks to both @jmair and @jishnub for providing me with alternatives to try around.

Sounds like a bug though, the program shouldn’t require a force-terminate. Could you file an issue on the Julia repo? It would be good to have this looked into