Hi, I am currently learning Julia and looking at the ecosystem to understand how to rewrite some existing code. I launch my code on a cluster, and therefore one part I’m curious about is how to use Julia features to simplify my current workflow.
What I’m currently doing (with Matlab code) is pretty laborious:
- modify the code on my laptop, and generate the list of jobs/parameters to launch;
- run a script to rsync the code to a location available to the cluster’s nodes;
- ssh to the frontend (no direct access to the nodes);
- launch from the frontend an interactive session on a node, “compile” the code, run a script to generate a wrapper setting the environment.
- from the frontend, call a launcher script that will submit a job for each desired set of parameters. Given that some jobs can crash for various reasons, the script will check the output files and the running jobs to detect which ones should be (re)launched.
- Wait for the jobs to be scheduled and run.
(6bis. Realize there was a human error somewhere and go back to 1.)
- run a script to fetch the results (csv files) on the laptop.
Now, I guess I could have a similar workflow with Julia, but it looks like some Julia features/packages might help to simplify that. How would you do that?
For instance I read about ClusterManager, which I understand as an abstract layer for launching jobs on various clusters (step 6.). Would that make sense / be possible to also use that directly from the laptop, and do everything from there? Are these features only supposed to be useful when each job is distributed over multiple cpus/cores, or does it also makes sense to use them when each job is launched on 1 cpu?