How to run Julia on Cluster?

Hi,

I am trying to run Julia on the Cluster. the main problem for me is to write the job script to tell the cluster to run my Julia commands. the cluster is using Slurm.
anyone have an idea how to do it?

It would help to be more specific about where you are facing problems.

In my case, I have a julia function in my project that writes the sbatch file for each case that I want to run. Something like

#!/bin/bash 

#SBATCH -N 1 
#SBATCH -n 10 
#SBATCH -t 0-16 
#SBATCH -p general 
#SBATCH --mem-per-cpu 8000 
#SBATCH -o '/nas/longleaf/home/lhendri/Documents/projects/p2019/college_stratification/log/hTypes2Bounded.out' 
#SBATCH --mail-type=end 
#SBATCH --mail-user=me@myemail.edu 
 
export JULIA_NUM_THREADS=10 
julia --project="." --startup-file=no "longleaf/run_hTypes2Bounded.jl"

Another function writes the script to be run, which looks like

using Pkg
Pkg.instantiate()
using CollegeStrat 
using ModelParams 

find_global_opt([:hTypes2, :consAggrCes, :k0Beta], 100000, 1000, 100; maxHours = 8, loadStartingGuess = true, localSolver = :nelder) 

println("Done.") 
println("--------------") 

All of this gets copied to the cluster with rsync. Then I just issue

sbatch slurmfile.sl

I hope I am not completely missing your question. More detail would really help.

2 Likes

Rather than using sbatch or srun commands, I would use ClusterManagers.jl package. That being said, it is probably very important that you understand how srun and sbatch work since thats what ClusterManagers.jl uses under the hood. It will also help you manage your cluster better. Next I would recommend reading the Multi-processing and Distributed Computing · The Julia Language manual to understand functions and methods for running parallel code.

After that, it’s a fairly easy process. First add your processors.

using ClusterManagers
addprocs(SlurmManager(nProcs), N = nNodes, other kwargs...) 

Then define your computationally function on all worker processes.

@everywhere function work(sim_id)
   #do heavy expensive calculation here
end

Then use the high level, user friendly pmap function to run your over the workers, i.e.

pmap(x -> work(x), 1:nsims) 

See my reply here for more information: Parallel programming capabilities in Julia - Usage / First steps - JuliaLang

7 Likes

I have my Julia command line from Jupyternotebook named as planb.jl stored at my laptop.
and Julia is installed in the Cluster. now I want to run planb.jl in Cluster.

this is the Job Script example from the documentation,
but I do not know how to add plan.jl file in the job script.

#!/bin/bash
#SBATCH --ntasks=1 # 1 core(CPU)
#SBATCH --nodes=1 # Use 1 node
#SBATCH --job-name=my_job_name # sensible name for the job
#SBATCH --mem=1G # Default memory per CPU is 3GB.
#SBATCH --partition=verysmallmem # Use the verysmallmem-partition for jobs requiring < 10 GB RAM.
#SBATCH --mail-user=myemail@nmbu.no # Email me when job is done.
#SBATCH --mail-type=ALL
module load Julia # Load the BWA software

this is how the planb.jl file looks like (the Julia command lines from Jupyternotebook)
and it just small part of planb.jl file.

using Pkg
using SnpArrays
import Pkg; Pkg.add(“BenchmarkTools”),
import Pkg; Pkg.add(“DelimitedFiles”),
import Pkg; Pkg.add(“CUDA”),
import Pkg; Pkg.add(“Glob”)
using SnpArrays, BenchmarkTools, DelimitedFiles, Glob
using CUDA
datapath = normpath(SnpArrays.datadir())
readdir(glob"cowdata_not_in_Q.*", datapath)
const Cows = SnpArray(SnpArrays.datadir(“cowdata_not_in_Q.bed”))
size(Cows)
Bgenemat = convert(Matrix{Float16}, Cows)
using StatsBase
casnpmat = @view convert(Matrix{Float16}, Cows)[:, sample(axes(convert(Matrix{Float16}, Cows) ,2), 4580, replace=false, ordered=true)]

This is what the last two lines in my example accomplish:

I upload the code to the project directory with rsync. That way, I have a Project.toml and Manifest.toml in place that match my (tested) local version of the code.

All the jl file needs to do then is

using Pkg
Pkg.instantiate() # in case some package is missing on the remote
using MyPackage
command_I_want_to_run()
1 Like

@hendri5,

I submitted my Job scripts like this. and somehow I cannot use 10 nodes, then I changed it to 1.
after submitting job script, I did not get error message( like Slurm…out).
so seem its working. and thank you very much.

#!/bin/bash
#SBATCH --ntasks=1 # 1 core(CPU)
#SBATCH --nodes=1 # Use 1 node
#SBATCH --job-name=planB # sensible name for the job
#SBATCH --mem=15G # Default memory per CPU is 3GB.
#SBATCH --partition=smallmem # Use the verysmallmem-partition for jobs requiring <10 GB RAM.
#SBATCH --mail-user=wubu@nmbu.no # Email me when job is done.
#SBATCH --mail-type=ALL
#SBATCH -o’C:/Users/dell/Desktop/julia/planb.out’
export JULIA_NUM_THREADS=1
julia --project=“.” --startup-file=no “User/run_planb.jl”

@hendri54
seems like I did not fully understand the procedure and quite confused :frowning:
here is some question I want to ask?
1, how can I get Project.toml and Manifest.ml in Julia?

2, should I upload my julia code and dataset in Project.toml and Manifest.ml?
what is the difference between .toml and .ml file? is this process within the Julia or on the Cluster terminal?

3, where does this script is used?

using Pkg
Pkg.instantiate() # in case some package is missing on the remote
using MyPackage
command_I_want_to_run()

it would be extremely helpful that you can explain the process step by step.
many thanks in advance :pray:

Like I mentioned in my reply, you should use ClusterManagers within julia to manage your slurm job. Second you need to know the difference between running parallel Julia processes using Slurm or shared-memory threading. You don’t need to use JULIA_NUM_THREADS at this stage if all you are looking for is to run your script independently on multiple processors (a paradigm known as embarrassingly-parallel).

The way I do it (there may be better alternatives, but it works well for me):

  1. Write the sbatch file (using a script)
  2. Write the “.jl” file to be run on the remote (also using a script). This is the one that starts with using Pkg.
  3. Upload code with rsync; including Project.toml and Manifest.toml (the entire package directory). (There is no “.ml” file; it’s Manifest.toml).
  4. At the login node terminal prompt: submit the sbatch file.

Probably worth pointing out: I am running multi threaded (not distributed) code.

I upload my dataset and ,jl file in the directory where Project.toml and Manifest.toml. located.
is that correct?
then I upload the Sbatch file from the terminal.
although my job is running.
Why I got an empty (Slurm JOBID.out) file?

My approach is to upload the entire repo (including the “.jl” and “.toml” files).

As for data files, there are various options.
Small files, I just keep in the repo.
For larger files, you could use Artifacts or DataDeps.jl. But that is a separate conversation.

About the out file, I don’t know the answer. But have another thread about that open elsewhere.