Trivial question about workers on a cluster

Good day everyone!

As the title suggests, I have a (probably trivial) question about how to run a parallel computation on multiple nodes in a cluster.

My code calls an external library, written in C, and then parallelizes across multiple processes. I specify that, through the following commands:

(…)
export JULIA_LOAD_PATH = (directory in which my julia interface is defined)
export LD_LIBRARY_PATH = (directory in which my C library is defined)

./julia -p (number of processes) ./myprogram.jl etc …
(…)

my code works perfectly, both on the cluster and on my laptop. But now I would like to start using more cores than those on a single node, that is, I would like to start running the code on multiple nodes through the cluster I am using.

After specifying, in the .sh file, that I want to launch my code through multiple nodes in the clusters (specifying the number of cpu etc.), I immediately noticed that, through the commands:

(…)
export JULIA_LOAD_PATH = (directory in which my julia interface is defined)
export LD_LIBRARY_PATH = (directory in which my library is defined)

srun hostname -s > hostfile

./julia --machine-file ./hostfile ./myprogram.jl etc.
(…)

The same code doesn’t work, because the workers can’t find the files that, inside the “myprogram.jl” program, are used as:

@everywhere using Mypackage

In fact the output I get is:

ERROR: LoadError: On worker 2:
ArgumentError: Package Mypackage [top-level] is required but does not seem to be installed:

  • Run Pkg.instantiate () to install all recorded dependencies.
    (…)

This also occurs if I specify that I want to launch the code on a single node with the above syntax i.e. using --machine-file etc., so the problem is definitely in the way I am telling julia to launch the processes and how (and from where) to upload packages. Of course I have to do something different, but even reading the documentation (as well as other similar discussions) I don’t understand exactly what.

Can anyone tell me exactly where I’m wrong and what should I do?

Thanks everyone in advance for your help!

I think this might be related to your use of environment variables. These are set by your top-level bash script and therefore seen by your “master” julia process, but not propagated to “worker” processes which are started over ssh.

In any case, I think you should avoid fiddling with the JULIA_LOAD_PATH environment variable: just use the package manager instead, asking it to activate the environment of your project.

Your shell-script would look like:

#!/bin/bash

srun hostname -s > hostfile
julia --machine-file=hostfile /path/to/MyPackage/run.jl

(note that /path/to/MyPackage above refers to the path of the top-level directory of your package; this is where the Project.toml file should be located. For simplicity I’ll assume that the main julia program run.jl is also located in the same directory).

Your main julia program would be in charge of activating the environment on all processes (including the workers), as well as setting the other needed environment variables:

using Distributed

@everywhere begin
  # Set other environment variables
  ENV["LD_LIBRARY_PATH"] = "/path/to/shared/library"

  # Take care of activating the correct environment
  using Pkg
  Pkg.activate(@__DIR__) # assuming run.jl is in the same directory as Project.toml

  using MyPackage # now this should work, because the correct environment is active
end

# ...