Following my last question, I am now trying to set up dr.watson.jl for my routines on a cluster. I wonder if someone has experience on this matter? For running a job the following things happen:
- I have a julia script, generate_run.jl, which creates a custom run.sh script, which is submitted via sbatch. This julia script also takes care of the parameters, config files (config.json) for the run.sh and the savepath. This most probably will then be nearly fully replaced by dr.watson.
- This run.sh has a srun inside (and parameters like the number of cores, RAM, which vary a lot depending on the function which is to be called), which again runs the corresponding julia script, script.jl. In some calculations I run an array of tasks inside a job, where each has an id, which is given as an additional parameter to script.jl, which takes care of that loading the corresponding entry in the config file. The script loads a library of mine, which takes care of all the routines. In the end, I save everything using jld, since the data is large and comes in my own types.
- The data is loaded using another julia script.
It would be nice if someone could share some code how they are tackling this problem and how they resolve the specific issues, since some of the routines in dr.watson have to be modified in order to reproduce this behavior. In particular, I do not see how problems like this array submitting to the cluster can be taken care of.
With best regards,
You should take a look at
ClusterManagers.jl. This will let you avoid using batch scripts and everything can be contained in Julia. I havn’t really used Dr Watson that much (planning to use it for my next project), but heres how my current workflow is.
Let A be a module with function
main() that we wish to execute in a embarrassingly parallel way. Then consider a run script
run.jl that will run the function
main() and process its results. For me this script usually has the following format:
# import packages required to process data.
# can include DrWatson here.
# load your own module, which gives you access to A.main()
# use ClusterManagers to connect to slurm, and note that I give the exeflag so that I am in the right environment.
addprocs(SlurmManager(500), N=17, topology=:master_worker, exeflags="--project=.")
# important! make sure A.main() is available on all the processors/workers.
@everywhere using A
results = pmap(1:number_of_sims) do x
# at this point results will be Array with number_of_sims elements and
# and each element will be the return value of main()
# process your results - take averages, write sim data to file, and so on.
# use Dr Watson to manage your simulation as well.
Hi @affans, thanks for sharing your script. Could I ask exactly how you’re telling slurm to run
run.jl? Are you basically just running Julia from your sbatch script; i.e. something like the following?
#SBATCH blah blabh blah
julia --project=@. <filename>
@affans I am running a Julia code that does a Monte Carlo simulation on a LSF cluster.
I would like to improve my implementation and I am not sure how.
Currently, I have a script that knows how to run the simulation, and it writes the results to a file.
This script takes more than a day to run.
To get statistics, I run O(1000) jobs that run the same script.
I do it using the job-array option, which simply run many instances of the same script.
I was wondering how can I use your example in my case.
In the script you wrote above, I do not understand what runs the function
run(). I also am not sure where should one specify how many jobs to run.
Also, should this script run on the cluster? Do you simply execute it with
Sorry for the late reply. I don’t basically use Slurm from the command line at all. Using
ClusterManagers abstracts that away from me, and I simply just use
pmap or other functions that distribute the work over the added processors.
I am not sure how to manage/deal with a LSF cluster. My entire experience is either with PBS or Slurm. However, it does seem like
ClusterManagers ( JuliaParallel/ClusterManagers.jl (github.com)) can also work with LSF clusters. Here are brief description of commands needed to parallelize your Monte Carlo simulation.
Suppose you have a function that does the heavy computation. Let’s call this function
work in a file called
do computational expensive work.
work() # run the function
Most often, people would just use
sbatch (not sure what the equivalent is in LSF) to launch
n parallel processes (i.e. basically launching
julia main.jl across n workers). The alternative is to manage everything within julia itself using Cluster Managers. All you have to do is launch Julia and include the correct packages.
First you’d add worker processes using
addprocs. For LSF, the command seems to be
addprocs(LSFManager(np, bsub_flags, ssh_cmd, retry_delays, throttle))
where you’d have to investigate what those arguments are and how they are used within LSF. Once your worker processes are added (check by
nprocs()) , you now need a way to distribute your job. This is fairly easy by using
Here, pmap will run the function
work over all the added processors . In particular, it will run
nsim times. The results from each returned function
work is stored in an array. I’ve skipped some details, for example you want to make sure the function
work is available on workers by using
@everywhere. You have to make sure that since
pmap runs on the master process (likely on the head node), it may have less ram/hd than the compute nodes. And other considerations.
Thanks for the answer!
I am still not sure then what is the benefit of running
addprocs() rather then submitting many jobs to the cluster.
ClusterManagers.jl knows how to handle the job submission including all of the suspensions happening from the cluster itself?
I only use it as it keeps everything contained within Julia. I use
pmap to distribute my work function which then returns and array and I can do postprocessing right now.
Yes ClusterManagers should be able to handle all that, but I am not sure about the LSF clusters and how it’s implemented in ClusterManagers. The source code isn’t that hard to read and may be useful.
Is there a problem with writing result files in this way?
I am trying to write files and I get the following error:
Error launching workers
ClusterManagers.LSFException("ls_rstat: File operation failed: No such file or directory\nJob <128585> may not be in RUN status. Use bjobs or bhist to confirm.\nJob <128585>: Batch library internal error\n")
That looks like a LSF error, so I can’t help you with that. Never really worked with a LSF cluster. But google why you can’t save your data. Maybe permissions issue?