Running independent variable-length simulations on busy Slurm cluster

I’m running Julia code on a Slurm cluster, but this question is not entirely specific to Julia. I hope it’s OK to post it here anyway (I am also fine with a Julia-specific solution):


Suppose we have a large number of calculations to be done (e.g. a sweep over many different parameters of a single function) to be run with these characteristics

  • there is considerable inherent variance in runtime of the calculations (some might take a few seconds, some a few hours, depending on the parameters)
  • the calculations are independent (the underlying code/function is the same, but the input arguments vary)
  • the nodes of the available Slurm cluster are fairly heterogeneous in terms of hardware and the cluster is usually quite busy

What would be a good strategy to run these calculations in a short time and without wasting too many resources?

I will detail my current strategy and reasoning for it below and discuss some pros/cons. Any comments and pointers towards improvement are greatly appreciated :slight_smile:

Current Solution

I am currently using a a kind of “producer-consumer” pattern (not a professional software developer, so not sure if the terminology is correct) where I essentially

  1. define a list of all parameters to run (this is a file on a shared directory)
  2. run several independent Slurm jobs (a job array), where each job looks into the file, picks the next parameter set to work on, produces its output, and repeats, until either no more parameters are left to be run or the job reaches a time limit
  3. start a cleanup job upon completion of the job array, which merges the outputs of the worker jobs

In my experience, this strikes a good balance between starting the computation quickly (there is usually room for a couple worker jobs on the cluster at any given point in time), and averaging out the occasional slow parameter sets among many quick ones.

I am aware of ClusterManagers.jl, but I want to avoid one big resource allocation with many parallel tasks because in my experience it will take long to actually allocate the resources all at once (busy cluster). To my understanding, the package is not really meant for getting the allocation, hence job arrays are not really within its scope?

Is anyone here facing similar issues and using different workflows? My current solution is a bit brittle at some points, so I am thinking of streamlining it and wrapping it into a package for convenience. E.g. it’s a bit fiddly to make the worker processes communicate via the parameter list file (involves locking) or getting a good overview over the progress of the computation as a whole.

So it would be great to hear about alternatives. Maybe I’m just misunderstanding something about the way to use the tools above.


I can’t claim that there is a single mechanism that you can just plug into and get what you want, and your approach certainly sounds reasonable, but I think there are good options for adjusting your workflow to make this easier. In particular, I’ve had very good results with DrWatson. In general, I think it’s full of great ideas that could really improve a lot of processes for many of us. But it can seem daunting at first, so I’ll highlight a few key ideas that I think are most relevant for you.

First, is the idea of the savename, which will construct a file name with key-value pairs of the relevant parameters to your main data-creation function, so that the file name fully describes what data are in your file. This name can also be parsed automatically. Note that such file names could get crazy long if you use lots of parameters in the name, so there’s also an option to use hashes instead, but that means you can’t really invert the name to a precise set of parameters; it just lets you know if that set has been run.

Next, DrWatson has nice functions save your data — optionally with enough git info to actually be able to reproduce the exact code used to do so, and optionally even including the local path of the script and line number where the save function was called.

Now, your notion (1.) of defining a list of all parameters to run is described with the dict_list function, and saving it to a shared directory with tmpsave.

Then, you can put those ideas together with the produce_or_load function, which takes your main data-creation function and a set of parameters that you would pass to it, checks to see if the file exists (because it would be named according to those parameters), and either loads the data if it exists or runs the simulation.

Now, this is all lovely if your main function will run relatively quickly. But if each call should really be a separate submission on your cluster, it’s not ideal. So your item (2.) is discussed here. You could combine this with the logic of the produce_or_load function, to make it check if the file exists or submit a job to produce it, if you think you might need to add some new runs with different parameters, or if you worry that some runs may fail and need to be rerun.

Finally, once everything is done, your item (3.) is discussed as part of the collect_results! function.

I don’t see any great reason to run these as a job array. I have used them in the past, but only for simple parameter ranges. In this case, julia will handle deciding on the parameters for you, so slurm doesn’t really need to know they’re related. It can make it easier to monitor your progress, though, since all the job IDs start with the same number.

In the past, I’ve also tried approaches involving having an actual database file that each job looks up to find the next set of parameters. Databases have solved the problem of locks, so this is easy to do. But you still have to do a lot to ensure that failed runs get rerun, etc. I even once used a script running on the head node (it was very low usage, so nobody cared) that started, monitored, and restarted jobs if necessary. But that was an even bigger nightmare. I really think that generically, the simple approach you’ve described and DrWatson helps with is the best.

1 Like

Thanks for taking the time to write such an extensive answer, I appreciate it!

I’ve heard a lot of good things about DrWatson.jl and it’s definitely on my list of things to check out, although I’ll need a while to follow through how my current workflow would exactly translate into using DrWatson.

My remaining question would still be how the actual implementation of the produce part of the produce_or_load function would look like. From my current understanding, it would have to use the same setup that I already have – using job arrays – and avoid starting a single job for each individual parameter set for the reasons mentioned above.

I probably should have added that there is also significant startup time involved in setting up the simulation for a single parameter set, but once everything is set up, running a new parameter set in the same Julia session can skip this compilation, etc. So for 1000s of individual parameters to sweep over, I’m taxing the Slurm scheduler unnecessarily and waste a lot of total computation time if I start a single job for each one. That’s another reason why I went for individual worker jobs instead of one-job-per-parameter ore one-job-total solutions.