Julia can be started in parallel mode with either the -p or the --machinefile options. -p n will launch an additional n worker processes, while --machinefile file will launch a worker for each line in file file. The machines defined in file must be accessible via a passwordless ssh login, with Julia installed at the same location as the current host. Each machine definition takes the form [count*][user@]host[:port][bind_addr[:port]]. user defaults to current user, port to the standard ssh port. count is the number of workers to spawn on the node, and defaults to 1. The optional bind-to bind_addr[:port] specifies the ip-address and port that other workers should use to connect to this worker.
thanks salchipapa. it doesn’t like me. the installation is exactly at the same spot in both machines.
$ julia --machinefile machinefile
ssh: Could not resolve hostname 5: nodename nor servname provided, or not known
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::Base.Distributed.SSHManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::Base.Distributed.SSHManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::Base.Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{Base.Distributed.SSHManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
reading the docs, it seems like all I need to do is to put into the machinefile
5 164.67.165.22
and I should be ready to go. (IP was made up.) is this a correct format (5 processes to be started up on 164.67.165.22.)
password-less and username-less ssh works just fine:
> ssh 164.67.165.22 '/Applications/Julia-0.6.app/Contents/Resources/julia/bin/julia -e "println(\"hello\")"'
hello
back to home
does it need another open port? anything else? what is the simplest way to check what this means?
great. this works. it is incompatible with the -p julia switch, but works fine without it. I can add not only localhost, but do plain addprocs(), too. so I am pretty much all set. thank you.
PS: If someone has a working plain machinefile, I am curious what it should have looked like.
No. You can use the keyword arguments exename to specify the julia binary location (and the environment to load) and dir to specify the working directory. I oft start workers on a small linux cluster from a windows machine.
Is it possible to let Julia work out automatically the available process on a pool of machines instead of explicitly stating n processes in MachineA? It’s like we specify we want 64 process and tell Julia that we have Machine[A-Z], then it’s up to Julia which CPU to use.
Also, is there any performance difference if I use SSHManager, i.e. by explicitly stating the node as in your tutorial, vs using ClusterManagers (my cluster is PBSManager)? The reason being is I can’t get PBSManager working, since there is no clear example on what’s needed (I opened an issue here)
I am trying to connect from my Windows computer to several Linux computers via ssh.
It does not work if I use the REPL from the julia directory. However, it does work if I first start julia from the git bash on Windows and then add the workers. Does anyone experience the same behaviour and is there a reason for this or can it be fixed? I would like to use the REPL in atom (from where it also does not work).
The code I am using is
using Distributed
addprocs([("john.doe@linuxcomputer", 3)]; dir="/home/john.doe/julia-project", exename="/home/john.doe/software/julia/julia-1.3.1/bin/julia", tunnel=true)
Note: I have activated passwordless ssh login beforehand.