Getting started with HPC and Julia

Hi,

I’ve built small cluster (master node and 4 other nodes) on the cloud and now I need to check whether Julia Distributed is able to use it.

Setup:
Julia dirpath: ~/julia - all the nodes have it
Single shared folder: ~/shared - created on master node and mounted on all other nodes

When I lauch Julia and check the number of workers nworkers() it shows only 1.

I guess something doesn’t work as I expect.
What are possible my mistakes?

How are you launching julia?

Hi,

julia executable is in the PATH env var so I just type julia on master node.

Julia by default starts with a single process:

% julia -E 'using Distributed; nprocs(), nworkers()'
(1, 1)

so there’s nothing surprising there. You may want to read the documentation of the Distributed standard library.

I know I can run julia -p 5 but even if julia ca’nt see other nodes it will still be launched with 5 workers splitting single machine CPU by workers.

For now the main task is to understand whether Julia can actually see other nodes or not.
I could probaly check it by IP of each worker but can’t find information on that.

Worth sharing:

2 Likes

Thank you for sharing this!

I guess for now the main problem is that I can’t setup the cluster. Distributed can’t see other nodes :frowning:

Are you using the ClusterManagers.jl as explained in the resource above? Or creating the submission scripts? What is the resource manager installed in your cluster? Most Linux clusters I worked with use SLURM.

1 Like

Oh I think I don’t have any cluster manager yet.
I’m reading about SLURM now.Probably after setting it we will see some progress.

1 Like

Hi,

It seems I successfully created small cluster and installed SLURM.
sinfo outout:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3   idle cl1cfqj5m64qsem9ra43-ecaj.ru-central1.internal,cl1cfqj5m64qsem9ra43-ukeq.ru-central1.internal,master-fwi.ru-central1.internal

srun -N3 hostname produces:

cl1cfqj5m64qsem9ra43-ecaj.ru-central1.internal
master-fwi.ru-central1.internal
cl1cfqj5m64qsem9ra43-ukeq.ru-central1.internal

I hope it is working correctly and it took me 3 days to solve this :slight_smile:

But then I return to Julia.
First of all I must be sure that Julia can see all nodes (do I have to launch Julia on all nodes or it is enought that julia bin directory was in the PATH env var?).
I installed ClusterManagers.jl] on all nodes and following its example I run:

using Distributed, ClusterManagers
addprocs(SlurmManager(2), partition="debug", t="00:5:00")

and then Julia hangs on:

connecting to worker 1 out of 2
srun: error: cl1cfqj5m64qsem9ra43-ecaj.ru-central1.internal: tasks 0-1: Exited with exit code 1

Can you try the tutorial I shared without ClusterManagers.jl first? Just create a SLURM submission script and see if it works as expected. If it doesn’t then other people can help setting up slurm and Julia correctly.

1 Like

I cloned the directory you provided and cd into it.
Then in julia after Pkg.instantiate(); Pkg.precompile() I get error:

Pkg.instantiate(); Pkg.precompile()
┌ Warning: The active manifest file has dependencies that were resolved with a different julia version (1.9.0). Unexpected behavior may occur.
└ @ ~/shared/dev/julia-distributed-computing/Manifest.toml:0
ERROR: MethodError: no method matching joinpath(::Nothing)
Closest candidates are:
  joinpath(::AbstractString) at path.jl:251
  joinpath(::AbstractString, ::AbstractString...) at path.jl:296
Stacktrace:
  [1] stat(path::Nothing)
    @ Base.Filesystem ./stat.jl:109
  [2] ispath(path::Nothing)
    @ Base.Filesystem ./stat.jl:311
  [3] download_source(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}, urls::Dict{Base.UUID, Vector{String}}; readonly::Bool)
    @ Pkg.Operations /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:752
  [4] #download_source#57
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:742 [inlined]
  [5] download_source
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:740 [inlined]
  [6] instantiate(ctx::Pkg.Types.Context; manifest::Nothing, update_registry::Bool, verbose::Bool, platform::Base.BinaryPlatforms.Platform, allow_build::Bool, allow_autoprecomp::Bool, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1410
  [7] instantiate
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1325 [inlined]
  [8] #instantiate#252
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321 [inlined]
  [9] instantiate()
    @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321
 [10] top-level scope
    @ REPL[2]:1

Julia 1.6.7

I can see that in Manifest.toml the required Julia version is 1.9.0
Probably this is an issue?

The error message is clear about the Julia version. Also the tutorial I shared doesn’t have a script to run on slurm.

Can you read the README and follow the steps instead? You don’t need to clone the repository.

It seems I’ve found a way to run the code:

To run on multiple nodes I gues I ether have to prepare SLURM script or slightly modify julia script to add workers first right?
As I don’t know yet how to write SLURM script I prefer to addprocs(2) in the julia script.
But my nodes dont have CSV and ProgressMeter packages. How can I install these packages to all nodes?

Can you please continue reading the README?

I think we misunderstand each other…

I use julia 1.6.7 so it seems I can’t install this repo. Whe I use:

add https://github.com/Arpeggeo/julia-distributed-computing.git
     Cloning git-repo `https://github.com/Arpeggeo/julia-distributed-computing.git`
    Updating git-repo `https://github.com/Arpeggeo/julia-distributed-computing.git`
ERROR: expected a `name` entry in project file at `/tmp/jl_2cQn6c/Project.toml

So I don’t know why but I can’t install this package.

Then you say I don’t have to clone the package… Then to use it? The provided script uses data/ and result/ dir.

Reading README doesn’t give much sorry…

The README teaches you how to do distributed computing. You don’t need to clone the repo, reproduce the example, or anything like that.

If you had read the README from beginning to end you would find out the last section explaining the creation of submission scripts. It is super easy to search online for SLURM submission script and reproduce the example.

I will let other people help you from here.

3 Likes

Stupid question - why are you using Julia 1.6.7 ?
ALso another point - above you say that packages need to be installed on compute nodes.
You should have a shared directory which is mounted on all nodes. If a packages is installed it should be available everywhere. Maybe I am misunderstanding this.