Using DrWatson.jl with Slurm: Best practice tipps

Dear folks,

Following my last question, I am now trying to set up dr.watson.jl for my routines on a cluster. I wonder if someone has experience on this matter? For running a job the following things happen:

  1. I have a julia script, generate_run.jl, which creates a custom run.sh script, which is submitted via sbatch. This julia script also takes care of the parameters, config files (config.json) for the run.sh and the savepath. This most probably will then be nearly fully replaced by dr.watson.
  2. This run.sh has a srun inside (and parameters like the number of cores, RAM, which vary a lot depending on the function which is to be called), which again runs the corresponding julia script, script.jl. In some calculations I run an array of tasks inside a job, where each has an id, which is given as an additional parameter to script.jl, which takes care of that loading the corresponding entry in the config file. The script loads a library of mine, which takes care of all the routines. In the end, I save everything using jld, since the data is large and comes in my own types.
  3. The data is loaded using another julia script.

It would be nice if someone could share some code how they are tackling this problem and how they resolve the specific issues, since some of the routines in dr.watson have to be modified in order to reproduce this behavior. In particular, I do not see how problems like this array submitting to the cluster can be taken care of.
With best regards,
v.

2 Likes

You should take a look at ClusterManagers.jl. This will let you avoid using batch scripts and everything can be contained in Julia. I havn’t really used Dr Watson that much (planning to use it for my next project), but heres how my current workflow is.

Let A be a module with function main() that we wish to execute in a embarrassingly parallel way. Then consider a run script run.jl that will run the function main() and process its results. For me this script usually has the following format:

# import packages required to process data.
# can include DrWatson here.
using Distributed
using DataFrames
using Statistics
using UnicodePlots
using ClusterManagers
using DelimitedFiles

# load your own module, which gives you access to A.main()
using A 

# use ClusterManagers to connect to slurm, and note that I give the exeflag so that I am in the right environment. 
addprocs(SlurmManager(500), N=17, topology=:master_worker, exeflags="--project=.")

# important! make sure A.main() is available on all the processors/workers. 
@everywhere using A 

function run()
   results = pmap(1:number_of_sims) do x
       A.main()
   end
  # at this point results will be Array with number_of_sims elements and
  # and each element will be the return value of main()
  # process your results - take averages, write sim data to file, and so on. 
  # use Dr Watson to manage your simulation as well.
end
2 Likes

Hi @affans, thanks for sharing your script. Could I ask exactly how you’re telling slurm to run run.jl? Are you basically just running Julia from your sbatch script; i.e. something like the following?

#SBATCH blah blabh blah

julia --project=@. <filename>

@affans I am running a Julia code that does a Monte Carlo simulation on a LSF cluster.
I would like to improve my implementation and I am not sure how.

Currently, I have a script that knows how to run the simulation, and it writes the results to a file.
This script takes more than a day to run.
To get statistics, I run O(1000) jobs that run the same script.
I do it using the job-array option, which simply run many instances of the same script.

I was wondering how can I use your example in my case.
In the script you wrote above, I do not understand what runs the function run(). I also am not sure where should one specify how many jobs to run.
Also, should this script run on the cluster? Do you simply execute it with julia run.jl?

Sorry for the late reply. I don’t basically use Slurm from the command line at all. Using ClusterManagers abstracts that away from me, and I simply just use addprocs with pmap or other functions that distribute the work over the added processors.

I am not sure how to manage/deal with a LSF cluster. My entire experience is either with PBS or Slurm. However, it does seem like ClusterManagers ( JuliaParallel/ClusterManagers.jl (github.com)) can also work with LSF clusters. Here are brief description of commands needed to parallelize your Monte Carlo simulation.

Suppose you have a function that does the heavy computation. Let’s call this function work in a file called main.jl… ie.

function work() 
   do computational expensive work. 
   return xx
end

work() # run the function

Most often, people would just use sbatch (not sure what the equivalent is in LSF) to launch n parallel processes (i.e. basically launching julia main.jl across n workers). The alternative is to manage everything within julia itself using Cluster Managers. All you have to do is launch Julia and include the correct packages.

First you’d add worker processes using addprocs. For LSF, the command seems to be

addprocs(LSFManager(np, bsub_flags, ssh_cmd, retry_delays, throttle))

where you’d have to investigate what those arguments are and how they are used within LSF. Once your worker processes are added (check by nprocs()) , you now need a way to distribute your job. This is fairly easy by using pmap. i.e.

pmap(work, 1:nsims) 

Here, pmap will run the function work over all the added processors . In particular, it will run nsim times. The results from each returned function work is stored in an array. I’ve skipped some details, for example you want to make sure the function work is available on workers by using @everywhere. You have to make sure that since pmap runs on the master process (likely on the head node), it may have less ram/hd than the compute nodes. And other considerations.

1 Like

Thanks for the answer!

I am still not sure then what is the benefit of running addprocs() rather then submitting many jobs to the cluster.
Also, does ClusterManagers.jl knows how to handle the job submission including all of the suspensions happening from the cluster itself?

I only use it as it keeps everything contained within Julia. I use pmap to distribute my work function which then returns and array and I can do postprocessing right now.

Yes ClusterManagers should be able to handle all that, but I am not sure about the LSF clusters and how it’s implemented in ClusterManagers. The source code isn’t that hard to read and may be useful.

1 Like

Is there a problem with writing result files in this way?

I am trying to write files and I get the following error:

Error launching workers
ClusterManagers.LSFException("ls_rstat: File operation failed: No such file or directory\nJob <128585[11]> may not be in RUN status. Use bjobs or bhist to confirm.\nJob <128585[11]>: Batch library internal error\n") 

That looks like a LSF error, so I can’t help you with that. Never really worked with a LSF cluster. But google why you can’t save your data. Maybe permissions issue?

1 Like