I have a bunch of scripts in a folder
JuliaSEM/src/ that I am trying to run on a cluster.
My directory structure looks like:
JuliaSEM folder contents:
and some other folders.
src/ folder contents:
some other functions
My actual path to the main.jl file is
I have one file
run.jl in which I include the main function
include("src/main.jl"). This gives me an error on worker 2
home/prith/src/main.jl : No such file exists.
If I try
include("JuliaSEM/src/main.jl"), it gives me an error on the master worker
home/prith/JuliaSEM/JuliaSEM/src/main.jl : No such file exists.
If I try to include the absolute path, it still gives me error on the master process.
I have not seen this error when I am running it on my local system with 4 nodes (
julia -p 4 run.jl). I think somehow my directory structure is not translated to multiple nodes, and on other workers it just searches the home directory.
I am running the program on cluster just by using
julia -p 4 run.jl in my #PBS script file.
For a workaround, I copied all my files into the home directory and it works fine (though there are other errors, which I am working on). Any ideas?
You can use
@__DIR__ in front. See my post here: Distributed parallelism within packages/applications for an example
run.jl you could try
include(joinpath(@__DIR__, "src", "main.jl")).
Thanks a lot! This works.
A quick follow up question: have you worked with PBS batch files?
I am doing
#PBS -l nodes=4:ppn=4,walltime=20:00:100
julia -p 16 run.jl
and it is running without errors, but it seems much slower than the same program I was running on my local system with
julia -p 4 run.jl. I think I am not adding procs correctly. Any suggestions?
If you’re using Flux HPC cluster at the University of Michigan, Julia may not be configured properly there to use multiple nodes in a single job via SSH (another mechanism would be required).
Try this instead:
#PBS -l nodes=1:ppn=16,walltime=20:00:100
julia -p 16 run.jl
Your local HPC support staff should be able to help you with the specifics of Julia on the cluster you are using.
Thanks. Yes, I am using the umich hpc flux.
What you said works fine, that was gonna be my backup plan. Manually running multiple jobs to get multiple simulation results. The only issue is that it will require walltime of 200 hrs.
I will talk to the system admin.
I just noticed you work at U of M. I had actually talked to support to get julia/1.0.0 installed. I assumed that julia would natively support cluster management. I am using lsa-flux, standard account.
What would be the other alternative mechanism? I know that
MPI.jl is not yet available for julia 1.0.0.
I have also not been able to install external packages like
JLD2 on my login node. I am wondering if this is normal for umich flux users.
Thanks a lot!
This probably isn’t a good forum in which to discuss institution-specific issues since they won’t apply to most people who are watching this topic. Contacting your local HPC support staff directly would be best.
MPI is the best solution for using multiple nodes on clusters that do not allow SSH between nodes in compute jobs.
A not-good solution which we have used before is to modify the code that runs
ssh name-of-node in order to start processes on remote nodes to instead run either
pbsdsh -o -h name-of-node or
mpirun --map-by ppr:1:node -H name-of-node where
name-of-node should be replaced with the name of the remote node on which the job is trying to start the process. Note that although the second command uses
mpirun, it is just used to spawn processes on the remote node and isn’t actually using MPI (message passing) for interprocess communications.
The solution above is not good because neither
mpirun is a drop-in replacement for SSH and you may have greater or lesser degrees of success depending on what the software you are using expects; I have not looked to Julia and its packages to see whether either of these would work, and they may not work.
My advice would be to wait until
MPI.jl is working with Julia 1.0.0. Until then, you can use a single node, request all of the cores on the node, and run
julia -p auto to start one local Julia worker process for each core your job has access to on the node.
Yeah sorry, I was just excited to see someone from U of M on this forum.
Thanks for the advice, I’ll run on a single node for now.