I know I can run julia -p 5 but even if julia ca’nt see other nodes it will still be launched with 5 workers splitting single machine CPU by workers.
For now the main task is to understand whether Julia can actually see other nodes or not.
I could probaly check it by IP of each worker but can’t find information on that.
Are you using the ClusterManagers.jl as explained in the resource above? Or creating the submission scripts? What is the resource manager installed in your cluster? Most Linux clusters I worked with use SLURM.
I hope it is working correctly and it took me 3 days to solve this
But then I return to Julia.
First of all I must be sure that Julia can see all nodes (do I have to launch Julia on all nodes or it is enought that julia bin directory was in the PATH env var?).
I installed ClusterManagers.jl] on all nodes and following its example I run:
using Distributed, ClusterManagers
addprocs(SlurmManager(2), partition="debug", t="00:5:00")
and then Julia hangs on:
connecting to worker 1 out of 2
srun: error: cl1cfqj5m64qsem9ra43-ecaj.ru-central1.internal: tasks 0-1: Exited with exit code 1
Can you try the tutorial I shared without ClusterManagers.jl first? Just create a SLURM submission script and see if it works as expected. If it doesn’t then other people can help setting up slurm and Julia correctly.
To run on multiple nodes I gues I ether have to prepare SLURM script or slightly modify julia script to add workers first right?
As I don’t know yet how to write SLURM script I prefer to addprocs(2) in the julia script.
But my nodes dont have CSV and ProgressMeter packages. How can I install these packages to all nodes?
The README teaches you how to do distributed computing. You don’t need to clone the repo, reproduce the example, or anything like that.
If you had read the README from beginning to end you would find out the last section explaining the creation of submission scripts. It is super easy to search online for SLURM submission script and reproduce the example.
Stupid question - why are you using Julia 1.6.7 ?
ALso another point - above you say that packages need to be installed on compute nodes.
You should have a shared directory which is mounted on all nodes. If a packages is installed it should be available everywhere. Maybe I am misunderstanding this.