Getting started with HPC and Julia

kerim · September 25, 2023, 9:43am

Hi,

I’ve built small cluster (master node and 4 other nodes) on the cloud and now I need to check whether Julia Distributed is able to use it.

Setup:
Julia dirpath: ~/julia - all the nodes have it
Single shared folder: ~/shared - created on master node and mounted on all other nodes

When I lauch Julia and check the number of workers nworkers() it shows only 1.

I guess something doesn’t work as I expect.
What are possible my mistakes?

giordano · September 25, 2023, 9:58am

How are you launching julia?

kerim · September 25, 2023, 9:59am

Hi,

julia executable is in the PATH env var so I just type julia on master node.

giordano · September 25, 2023, 10:03am

Julia by default starts with a single process:

% julia -E 'using Distributed; nprocs(), nworkers()'
(1, 1)

so there’s nothing surprising there. You may want to read the documentation of the Distributed standard library.

kerim · September 25, 2023, 10:08am

I know I can run julia -p 5 but even if julia ca’nt see other nodes it will still be launched with 5 workers splitting single machine CPU by workers.

For now the main task is to understand whether Julia can actually see other nodes or not.
I could probaly check it by IP of each worker but can’t find information on that.

juliohm · September 25, 2023, 10:24am

Worth sharing:

kerim · September 25, 2023, 10:40am

Thank you for sharing this!

I guess for now the main problem is that I can’t setup the cluster. Distributed can’t see other nodes

juliohm · September 25, 2023, 10:48am

Are you using the ClusterManagers.jl as explained in the resource above? Or creating the submission scripts? What is the resource manager installed in your cluster? Most Linux clusters I worked with use SLURM.

kerim · September 25, 2023, 10:50am

Oh I think I don’t have any cluster manager yet.
I’m reading about SLURM now.Probably after setting it we will see some progress.

kerim · September 28, 2023, 9:41am

Hi,

It seems I successfully created small cluster and installed SLURM.
sinfo outout:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3   idle cl1cfqj5m64qsem9ra43-ecaj.ru-central1.internal,cl1cfqj5m64qsem9ra43-ukeq.ru-central1.internal,master-fwi.ru-central1.internal

srun -N3 hostname produces:

cl1cfqj5m64qsem9ra43-ecaj.ru-central1.internal
master-fwi.ru-central1.internal
cl1cfqj5m64qsem9ra43-ukeq.ru-central1.internal

I hope it is working correctly and it took me 3 days to solve this

But then I return to Julia.
First of all I must be sure that Julia can see all nodes (do I have to launch Julia on all nodes or it is enought that julia bin directory was in the PATH env var?).
I installed ClusterManagers.jl] on all nodes and following its example I run:

using Distributed, ClusterManagers
addprocs(SlurmManager(2), partition="debug", t="00:5:00")

and then Julia hangs on:

connecting to worker 1 out of 2
srun: error: cl1cfqj5m64qsem9ra43-ecaj.ru-central1.internal: tasks 0-1: Exited with exit code 1

juliohm · September 28, 2023, 9:51am

Can you try the tutorial I shared without ClusterManagers.jl first? Just create a SLURM submission script and see if it works as expected. If it doesn’t then other people can help setting up slurm and Julia correctly.

kerim · September 28, 2023, 10:05am

I cloned the directory you provided and cd into it.
Then in julia after Pkg.instantiate(); Pkg.precompile() I get error:

Pkg.instantiate(); Pkg.precompile()
┌ Warning: The active manifest file has dependencies that were resolved with a different julia version (1.9.0). Unexpected behavior may occur.
└ @ ~/shared/dev/julia-distributed-computing/Manifest.toml:0
ERROR: MethodError: no method matching joinpath(::Nothing)
Closest candidates are:
  joinpath(::AbstractString) at path.jl:251
  joinpath(::AbstractString, ::AbstractString...) at path.jl:296
Stacktrace:
  [1] stat(path::Nothing)
    @ Base.Filesystem ./stat.jl:109
  [2] ispath(path::Nothing)
    @ Base.Filesystem ./stat.jl:311
  [3] download_source(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}, urls::Dict{Base.UUID, Vector{String}}; readonly::Bool)
    @ Pkg.Operations /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:752
  [4] #download_source#57
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:742 [inlined]
  [5] download_source
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:740 [inlined]
  [6] instantiate(ctx::Pkg.Types.Context; manifest::Nothing, update_registry::Bool, verbose::Bool, platform::Base.BinaryPlatforms.Platform, allow_build::Bool, allow_autoprecomp::Bool, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1410
  [7] instantiate
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1325 [inlined]
  [8] #instantiate#252
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321 [inlined]
  [9] instantiate()
    @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321
 [10] top-level scope
    @ REPL[2]:1

Julia 1.6.7

kerim · September 28, 2023, 10:07am

I can see that in Manifest.toml the required Julia version is 1.9.0
Probably this is an issue?

juliohm · September 28, 2023, 10:12am

The error message is clear about the Julia version. Also the tutorial I shared doesn’t have a script to run on slurm.

juliohm · September 28, 2023, 10:13am

Can you read the README and follow the steps instead? You don’t need to clone the repository.

kerim · September 28, 2023, 11:05am

It seems I’ve found a way to run the code:

To run on multiple nodes I gues I ether have to prepare SLURM script or slightly modify julia script to add workers first right?
As I don’t know yet how to write SLURM script I prefer to addprocs(2) in the julia script.
But my nodes dont have CSV and ProgressMeter packages. How can I install these packages to all nodes?

juliohm · September 28, 2023, 11:13am

Can you please continue reading the README?

kerim · September 28, 2023, 11:25am

I think we misunderstand each other…

I use julia 1.6.7 so it seems I can’t install this repo. Whe I use:

add https://github.com/Arpeggeo/julia-distributed-computing.git
     Cloning git-repo `https://github.com/Arpeggeo/julia-distributed-computing.git`
    Updating git-repo `https://github.com/Arpeggeo/julia-distributed-computing.git`
ERROR: expected a `name` entry in project file at `/tmp/jl_2cQn6c/Project.toml

So I don’t know why but I can’t install this package.

Then you say I don’t have to clone the package… Then to use it? The provided script uses data/ and result/ dir.

Reading README doesn’t give much sorry…

juliohm · September 28, 2023, 11:34am

The README teaches you how to do distributed computing. You don’t need to clone the repo, reproduce the example, or anything like that.

If you had read the README from beginning to end you would find out the last section explaining the creation of submission scripts. It is super easy to search online for SLURM submission script and reproduce the example.

I will let other people help you from here.

johnh · September 28, 2023, 2:51pm

Stupid question - why are you using Julia 1.6.7 ?
ALso another point - above you say that packages need to be installed on compute nodes.
You should have a shared directory which is mounted on all nodes. If a packages is installed it should be available everywhere. Maybe I am misunderstanding this.

Topic		Replies	Views
I am unable to run a simple distributed.jl code on my slurm cluster Julia at Scale parallel , distributed , slurm	11	643	February 10, 2024
How to parallel Julia on multiple nodes on HPC (slurm)? Julia at Scale question	11	3585	May 20, 2020
Distributed Computing with Slurm and Julia Julia at Scale	9	3544	February 10, 2022
How to get started with distributed memory parallel programming? New to Julia	3	695	June 9, 2021
Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia? General Usage question , package	44	2131	July 13, 2024

Getting started with HPC and Julia

Related topics