I’m running Julia code on a Slurm cluster, but this question is not entirely specific to Julia. I hope it’s OK to post it here anyway (I am also fine with a Julia-specific solution):
Problem
Suppose we have a large number of calculations to be done (e.g. a sweep over many different parameters of a single function) to be run with these characteristics
- there is considerable inherent variance in runtime of the calculations (some might take a few seconds, some a few hours, depending on the parameters)
- the calculations are independent (the underlying code/function is the same, but the input arguments vary)
- the nodes of the available Slurm cluster are fairly heterogeneous in terms of hardware and the cluster is usually quite busy
What would be a good strategy to run these calculations in a short time and without wasting too many resources?
I will detail my current strategy and reasoning for it below and discuss some pros/cons. Any comments and pointers towards improvement are greatly appreciated
Current Solution
I am currently using a a kind of “producer-consumer” pattern (not a professional software developer, so not sure if the terminology is correct) where I essentially
- define a list of all parameters to run (this is a file on a shared directory)
- run several independent Slurm jobs (a job array), where each job looks into the file, picks the next parameter set to work on, produces its output, and repeats, until either no more parameters are left to be run or the job reaches a time limit
- start a cleanup job upon completion of the job array, which merges the outputs of the worker jobs
In my experience, this strikes a good balance between starting the computation quickly (there is usually room for a couple worker jobs on the cluster at any given point in time), and averaging out the occasional slow parameter sets among many quick ones.
I am aware of ClusterManagers.jl, but I want to avoid one big resource allocation with many parallel tasks because in my experience it will take long to actually allocate the resources all at once (busy cluster). To my understanding, the package is not really meant for getting the allocation, hence job arrays are not really within its scope?
Is anyone here facing similar issues and using different workflows? My current solution is a bit brittle at some points, so I am thinking of streamlining it and wrapping it into a package for convenience. E.g. it’s a bit fiddly to make the worker processes communicate via the parameter list file (involves locking) or getting a good overview over the progress of the computation as a whole.
So it would be great to hear about alternatives. Maybe I’m just misunderstanding something about the way to use the tools above.