I am experimenting with using Julia as my interpreter for slurm run scripts. But I am having an issue and I am hoping someone has run into this before. For reference, my cluster uses Lmod.
As part of my run scripts I like to make sure any loaded modules are purged. If I use bash as the interpreter then the run script looks like this:
#!/bin/bash
#SBATCH directives...
module purge
module load XXXX
julia main.jl
Now with Julia though when I try to do this using run I get the following error (see farther down example run script)
ERROR: IOError: could not spawn `module list`: no such file or directory (ENOENT)
If I use Julia’s shell mode I do not get this error. So I assume this has something to do with paths and accessing the module command. If it matters, I read that the module command is a shell function (not sure what that means).
#!/usr/bin/env julia
#SBATCH directives...
run(`module purge`) # throws error shown above
run(`module load XXXX`)
proj="path/to/my/proj"
using Pkg
Pkg.activate(proj)
main="main.jl"
include(main)
Did you compare the environment variables when you start the script vs. when you run the offending line from within Julia? (you could compare the output of the shell command env in both cases)
I sometimes ran into issues because on my cluster, the shell environment available at the start of a cluster job was not the same (e.g. different HOME folder) and there are some options there to tweak.
Could be that this is related to if/how your .bashrc or profile files are read when starting the script vs. when you run the code from a login shell?
Is the second script you posted also run via Slurm or directly via the command line in a local shell? And is /usr/bin/env julia the same instance of Julia that would be started if you just use julia for your local tests (where the purge command worked)?
Yes - they appear to be the same (I did a text comparison).
The second script is being run by doing sbatch my_slurm_script.sh. I submit it from the login node. Note - this error I am getting occurs whether I am running Julia from the login node or whether it occurs in a batch job. For example if on the login node I start the Julia REPL and try to do run(`module list`) then I get the error. But if I switch to shell mode (; in the REPL) then it executes fine.
I checked and starting Julia via /usr/bin/env julia gives the same version as the one I get when I type julia in the terminal on the login node.
Thanks for the follow-up information! Despite my unfocused blanket questions I think we found the issue.
I just checked on our cluster (we also use Slurm + Lmod) and get the same behavior. The issue seems to be what you already mentioned in the original post:
I found similar descriptions here and on StackOverflow explaining the problem and the corresponding part of the docs:
The command is never run with a shell. Instead, Julia parses the command syntax directly, appropriately interpolating variables and splitting on words as the shell would, respecting shell quoting syntax. The command is run as julia’s immediate child process, using fork and exec calls.
Short summary:
The shell, e.g. bash or zsh, is a program (executable file/script somewhere on the system) that usually runs when you open a terminal and is used to run all kinds of other programs. Running commands in the shell will either look for another executable file in the system’s path variable(s) and run that or if the command is “built-into” the shell (or defined as a function in some shell script), run that. module seems to be the latter.
Since Julia doesn’t run commands in a shell if we use run(cmd), only actual executables in the path will be found. This explains also why run(`/usr/bin/bash -c module list`) worked (we first run the shell and then have the module shell function available again. Similarly, using the shell mode with ; will use a shell as well, so that also works.
Here’s a way to tell what is built-in and what is not (the which command looks for the executable file that corresponds of the command) EDIT: That’s not quite correct, please check the next reply by @sijo for an important clarification!
which ls will print something like /usr/bin/ls
which which → /usr/bin/which
but which source will give which: no source in (..long list of paths to look into..), the source command loads shell code from a file
and similarly which module also fails, which indicates that module is indeed not a executable file somewhere on the system, but only works within a shell
This is an issue with command parsing. As far as I understand, since () can be used in interpolation, e.g. like this:
we have to “qoute” the parentheses if they should appear in the final command.
But module("list") from the link you mentioned looks like a call to a Python function and it wouldn’t work in a shell most likely. The syntax for invoking a command with arguments is usually using spaces instead of parentheses, as in echo hello (not echo(hello)).
To elaborate on built-in shell commands: which xxx will show you where the command xxx can be found in the PATH but it’s not necessarily what gets used by your shell. To know how the shell actually interprets a command, you can use type. Here are a few examples to show the difference, using Bash 5.1.16:
Some commands are pure built-in:
$ which cd
$ type cd
cd is a shell builtin
And some commands correspond to an executable in the PATH, but it’s not used because the shell will use a built-in instead:
$ which printf
/usr/bin/printf
$ type printf
printf is a shell builtin
This can also happens when you define an alias:
$ which ls
/usr/bin/ls
$ type ls
ls is aliased to `lsd'
$ which lsd
/usr/bin/lsd
$ type lsd
lsd is /usr/bin/lsd
ls is in the PATH, but my Bash has an alias to replace it with lsd.
The module shell function is used to set up the PATH and other environment variables. This cannot be done in an executable because executables run in a sub-process, and environment variables are not inherited “upwards”. Thus, the module command must be a shell function, i.e. a shorthand for a sequence of shell commands. It’s typically enabled either in one of your shell startup scripts (.bashrc or similar), or in a system wide shell startup script.
Its working, I’m not sure, is typically to run an executable which prints the setup to stdout, which is then eval’ed. Things like eval $(ssh-agent), where ssh-agent produces output like SSH_AUTH_SOCK=/tmp/ssh-23224; SSH_AGENT_PID=23223, which is then eval’ed by the shell. The environment variables are inherited by subsequent executables.
Thus, even if you succeed in running module from inside julia via run(`bash -c …`) or similar, it will not, and cannot set environment variables in your julia process. It merely sets things in the bash shell you have run, and which exited before run(...) returned.
What could be possible is to run a shell as a sub-process of julia, with a virtual terminal attached, and forward shell commands to that. It sounds easier to use a shell script for slurm jobs.
It could also be possible to create a julia package for Lmod, it exists for python and R and some other things. Then the module command prints the right syntax to stdout, in julia’s case it would be things like ENV["PATH"] = ...; ENV["WHATEVER"]="what ever". It would be read by julia e.g. like setup=module("load intelcompiler") and eval’ed by julia like eval(Meta.parse(setup)).
Thanks everyone, these responses have been great and very helpful.
After reading everything though, I agree it would seem just easier to stick to bash at this point. It’s a bit of a shame because I have written a few elaborate bash scripts for preparing simulation inputs and parsing slurm/environment variables. Compared to Julia, bash is a bit clunky though so it was cumbersome to do so. I got excited after I read some on shell scripting in Julia.
I will consider writing a package for Julia following the example of the python lmod package (since I don’t know much about lmod in the first place). Or perhaps I need to rethink how I’ve structured my slurm run scripts (maybe I can move more code into the Julia main.jl that gets called).