HPC simulations with DrWatson (continued from Slack)

We can continue the discussion here so it doesn’t get lost in Slack.

Slack messages

Rachel Kurchin 31. Juli um 18:52 Uhr
Hi all, I had DMed @Datseris about this and he suggested I post here; I guess some other package maintainers are here and use HPC?
I’m really intrigued by the DrWatson.jl package and loved the JuliaCon talk! Been looking more into it and trying to figure out how I can best incorporate it in some of my HPC Julia workflows. Do you have any examples of integrating it with a package manager like Slurm, or thoughts on the best approach for such a thing? I’m not so concerned about it automatically submitting the jobs as I’m happy to do that myself, but something that’s a constant workflow headache for me is managing the error/log files produced and it seems this package could potentially be a really good solution for making sure those get attached to the correct versions of julia scripts, etc.
I’ve been chewing for awhile on the idea of writing some scripts to automatically give smarter names to error/output files from Slurm jobs than my current default which is just error. and output. and I think there’s potential here for something really cool and integrated…so often right now I end up deleting those files for failed jobs because they clutter up folders when really I should save them with sensible names and in sensible places…
Anyway, would love your thoughts. On a related note, I also do a fair bit of Python stuff on HPC systems so if some Slurm-related aspects could be adapted towards that too I might also be interested in helping to build some of this functionality if a template doesn’t already exist…
5 Antworten

George Datseris:juliapool: vor 3 Tagen
@sebastianpech, @Jonas Isensee @tamasgal I think what you guys do counts as HPC!

Sebastian Pech vor 17 Stunden
@rkurchin I think what you’re describing is more of a general issue with Slurm, etc … but I agree with you that DrWatson has the potential for simplifying those workflows.
I’ve been working on an extension of DrWatson (https://github.com/sebastianpech/DrWatsonSim.jl) which aims at exactly that. At the moment I’m running my simulations mostly on a single machine with tmux sessions, but the launch script can of course be adapted to use submit commands.
Basically the package consists of two parts: Storing metadata for arbitrary files or folders and a method to “submit” jobs and generate internal simulation ids for keeping track. The readme covers both parts. I’ve discussed the package with @Datseris and @Jonas Isensee an we concluded, that the submit part is still a bit too specific in it’s current implementation, however, I don’t have a better solution for a workflow atm that is as simple and non-intrusive as possible.
Currently running a simulation, automatically creates a folder with a unique id (incrementing counter) and uses that folder as the working directory for the simulation process. Besides creating the folder, a metadata entry is attached to it containing the info about used parameters, the environment, git commit, git patch and basically everything you want to attach additionally. Querying simulations than works through the metadata interface. For example what I often do is submit some similar simulations that I want to compare. So I query for one of the simulation id’s I know (eg. I load the metadata for that folder), extract from the field simulation_submit_group all other metadata entries that were submitted together and then group by one parameter in the field parameters to have two subplots containing the simulation results. This works very well with https://github.com/queryverse/Query.jl.
If you have any questions don’t hesitate to ask. I’m pretty sure I’m the only one using it at the moment, so the readme might not explain every detail of the package very well.

Przemyslaw Szufel vor 14 Stunden
@rkurchin I have been experimenting with writing own schedule manager for HPC that runs on AWS and integrates in Julia. Might be useful for you or not but I did this all in bash - here are all my ideas: https://github.com/pszufe/KissCluster

George Datseris:juliapool: vor 11 Stunden
just chiming in to say that everything you write here will be deleted in about a week at best

George Datseris:juliapool: vor 11 Stunden
_perhaps save this convo somewhere :leichtes_lächeln: _

Hmmm… I work in HPC and know Slurm quite well. I am very happy to participate and run tests if I can.

However… this seems to me like a discussion on modern Machine Learning workflows. TO be very honest and perhaps provocative this is something which the traditional HPC community have not been paying attention to.
Over in Machine Learning land, Kubernetes is of course extremely popular. On top of that we see utilities used for tracking and managing models such as these and many more.

Seldon https://www.seldon.io/
Polyaxon https://polyaxon.com/

I really do think the traditional HPC folks with SLurm schedulers can learn something from this community. I include myself there.
I wonder if there is already some integration between an ML workflow package and Slurm.

I am currently on holidays but I bookmarked this thread, thanks for that!

As an HPC user, I’d really welcome some Julia package to steer batch job submission and monitoring and I think we could squeeze it into DrWatson easily.

I myself use a small Python script which is customised to our experiment where you mostly process a huge number of run files and generate result files (e.g. thousands of ROOT files stored on tapes, filepath/names following a predefined scheme and an analysis script which prodocudes a single HDF5 file). It’s tailored to the batch system SGE which is used in many of our clusters. I have not used a Slurm system yet…


There are workflow managers for HPC, particularly coming out of the ‘Exascale’ materials modelling projects which have been running for the best part of a decade.

Two of the most established are Fireworks and AiiDA:

I think it would be good to try and leverage this work, particularly all the software engineering about getting data in and out, and writing, submitting and monitoring jobs on various styles of HPC and queue systems.