We can continue the discussion here so it doesn’t get lost in Slack.
Slack messages
Rachel Kurchin 31. Juli um 18:52 Uhr
Hi all, I had DMed @Datseris about this and he suggested I post here; I guess some other package maintainers are here and use HPC?
I’m really intrigued by the DrWatson.jl package and loved the JuliaCon talk! Been looking more into it and trying to figure out how I can best incorporate it in some of my HPC Julia workflows. Do you have any examples of integrating it with a package manager like Slurm, or thoughts on the best approach for such a thing? I’m not so concerned about it automatically submitting the jobs as I’m happy to do that myself, but something that’s a constant workflow headache for me is managing the error/log files produced and it seems this package could potentially be a really good solution for making sure those get attached to the correct versions of julia scripts, etc.
I’ve been chewing for awhile on the idea of writing some scripts to automatically give smarter names to error/output files from Slurm jobs than my current default which is just error. and output. and I think there’s potential here for something really cool and integrated…so often right now I end up deleting those files for failed jobs because they clutter up folders when really I should save them with sensible names and in sensible places…
Anyway, would love your thoughts. On a related note, I also do a fair bit of Python stuff on HPC systems so if some Slurm-related aspects could be adapted towards that too I might also be interested in helping to build some of this functionality if a template doesn’t already exist…
5 Antworten
George Datseris:juliapool: vor 3 Tagen
@sebastianpech, @Jonas Isensee @tamasgal I think what you guys do counts as HPC!
Sebastian Pech vor 17 Stunden
@rkurchin I think what you’re describing is more of a general issue with Slurm, etc … but I agree with you that DrWatson has the potential for simplifying those workflows.
I’ve been working on an extension of DrWatson (GitHub - sebastianpech/DrWatsonSim.jl) which aims at exactly that. At the moment I’m running my simulations mostly on a single machine with tmux sessions, but the launch script can of course be adapted to use submit commands.
Basically the package consists of two parts: Storing metadata for arbitrary files or folders and a method to “submit” jobs and generate internal simulation ids for keeping track. The readme covers both parts. I’ve discussed the package with @Datseris and @Jonas Isensee an we concluded, that the submit part is still a bit too specific in it’s current implementation, however, I don’t have a better solution for a workflow atm that is as simple and non-intrusive as possible.
Currently running a simulation, automatically creates a folder with a unique id (incrementing counter) and uses that folder as the working directory for the simulation process. Besides creating the folder, a metadata entry is attached to it containing the info about used parameters, the environment, git commit, git patch and basically everything you want to attach additionally. Querying simulations than works through the metadata interface. For example what I often do is submit some similar simulations that I want to compare. So I query for one of the simulation id’s I know (eg. I load the metadata for that folder), extract from the field simulation_submit_group all other metadata entries that were submitted together and then group by one parameter in the field parameters to have two subplots containing the simulation results. This works very well with GitHub - queryverse/Query.jl: Query almost anything in julia.
If you have any questions don’t hesitate to ask. I’m pretty sure I’m the only one using it at the moment, so the readme might not explain every detail of the package very well.
Przemyslaw Szufel vor 14 Stunden
@rkurchin I have been experimenting with writing own schedule manager for HPC that runs on AWS and integrates in Julia. Might be useful for you or not but I did this all in bash - here are all my ideas: GitHub - pszufe/KissCluster: The simplest cluster computing solution for the cloud, supports Python, R, Julia, Java, NetLogo, bash and everything else
George Datseris:juliapool: vor 11 Stunden
just chiming in to say that everything you write here will be deleted in about a week at best
George Datseris:juliapool: vor 11 Stunden
_perhaps save this convo somewhere :leichtes_lächeln: _