I have a bunch of scripts in a folder JuliaSEM/src/ that I am trying to run on a cluster.
My directory structure looks like:
JuliaSEM folder contents:
run.jl
src/
data/
and some other folders.
src/ folder contents:
main.jl
parameters/defaultParameter.jl
some other functions
My actual path to the main.jl file is /home/prith/JuliaSEM/src/main.jl
I have one file run.jl in which I include the main function include("src/main.jl"). This gives me an error on worker 2 home/prith/src/main.jl : No such file exists.
If I try include("JuliaSEM/src/main.jl"), it gives me an error on the master worker home/prith/JuliaSEM/JuliaSEM/src/main.jl : No such file exists.
If I try to include the absolute path, it still gives me error on the master process.
I have not seen this error when I am running it on my local system with 4 nodes (julia -p 4 run.jl). I think somehow my directory structure is not translated to multiple nodes, and on other workers it just searches the home directory.
I am running the program on cluster just by using julia -p 4 run.jl in my #PBS script file.
For a workaround, I copied all my files into the home directory and it works fine (though there are other errors, which I am working on). Any ideas?
A quick follow up question: have you worked with PBS batch files?
I am doing
#PBS -l nodes=4:ppn=4,walltime=20:00:100
julia -p 16 run.jl
and it is running without errors, but it seems much slower than the same program I was running on my local system with julia -p 4 run.jl. I think I am not adding procs correctly. Any suggestions?
If you’re using Flux HPC cluster at the University of Michigan, Julia may not be configured properly there to use multiple nodes in a single job via SSH (another mechanism would be required).
Try this instead:
#PBS -l nodes=1:ppn=16,walltime=20:00:100
julia -p 16 run.jl
Your local HPC support staff should be able to help you with the specifics of Julia on the cluster you are using.
What you said works fine, that was gonna be my backup plan. Manually running multiple jobs to get multiple simulation results. The only issue is that it will require walltime of 200 hrs.
I just noticed you work at U of M. I had actually talked to support to get julia/1.0.0 installed. I assumed that julia would natively support cluster management. I am using lsa-flux, standard account.
What would be the other alternative mechanism? I know that MPI.jl is not yet available for julia 1.0.0.
I have also not been able to install external packages like ClusterManager or JLD2 on my login node. I am wondering if this is normal for umich flux users.
This probably isn’t a good forum in which to discuss institution-specific issues since they won’t apply to most people who are watching this topic. Contacting your local HPC support staff directly would be best.
MPI is the best solution for using multiple nodes on clusters that do not allow SSH between nodes in compute jobs.
A not-good solution which we have used before is to modify the code that runs ssh name-of-node in order to start processes on remote nodes to instead run either pbsdsh -o -h name-of-node or mpirun --map-by ppr:1:node -H name-of-node where name-of-node should be replaced with the name of the remote node on which the job is trying to start the process. Note that although the second command uses mpirun, it is just used to spawn processes on the remote node and isn’t actually using MPI (message passing) for interprocess communications.
The solution above is not good because neither pbsdsh nor mpirun is a drop-in replacement for SSH and you may have greater or lesser degrees of success depending on what the software you are using expects; I have not looked to Julia and its packages to see whether either of these would work, and they may not work.
My advice would be to wait until MPI.jl is working with Julia 1.0.0. Until then, you can use a single node, request all of the cores on the node, and run julia -p auto to start one local Julia worker process for each core your job has access to on the node.