Hey Julia Community,
I’m currently using HTCondor to run batches of jobs on my university’s computing cluster. The problem I’m working on is “embarrassingly” (a.k.a. “pleasingly”) parallel, in that each job does the same task (i.e., runs a Julia script), but with a different argument.
I was wondering if anyone else is using HTCondor and has developed a workflow that works well for them. My current workflow is as follows:
- I used PackageCompiler.jl to save a system image with the packages I use. This is stored as “CHSysimage.so.gz” (note that it is zipped).
- My HTCondor submit file (“batch.job”) transfers this system image to each node and runs the executable/shell script “batch.sh”.
- The shell script (“batch.sh”) unzips the system image as a temporary file “CHSysimage-XXXXXX.so” (where XXXXXX is a random string) and uses it to run the Julia script “myscript.jl”.
- In “myscript.jl”, I use the global package environment (activating a local one for my project causes some jobs to fail, for reasons I don’t fully understand) to load some input files, do some computations, and then save an output file.
Some of the idiosyncrasies I’ve run into are:
- If I don’t precompile and transfer the packages in a system image, each job accesses a shared file system whenever it compiles a function for the first time. This places a huge burden on that system and has caused it to crash (and me to get in trouble with the admins!).
- My understanding of the best practices outlined by the Modern Julia Workflows blog is that I should put my mission-critical code into a package, and activate the directory containing that package as my local package environment. However, it seems that activating an environment within my Julia code also causes excessive burden on shared systems (to be honest, I don’t understand the reasons well enough to explain it – otherwise I might not be here asking for help).
I don’t have a computer science background, and I’m far from an expert on file systems and which under-the-hood aspects of Julia can become problematic. I’ve received some generous help from my university’s research computing center (and ChatGPT) – however, I haven’t had the chance to talk to a human with Julia expertise.
Any advice (general or specific) for using Julia on HTCondor and shared file systems would be greatly appreciated!