Multiple Computer Example

Does anybody have a really simple example of a local host farming out a task to another ssh-connected computer (with julia on it)?

think

@everywhere function abc(n::Int)
      sum=0; for i=1:n; sum+=i; end;#for
      return ( readstring(`hostname`), sum )
end#function

hosts = [ "localhost", "friend.ucla.edu" ]

println( pmap( i -> abc(i), 1:1000, hosts ) )

The docs look a bit overwhelming on the subject.

/iaw

1 Like

From:

Julia can be started in parallel mode with either the -p or the --machinefile options. -p n will launch an additional n worker processes, while --machinefile file will launch a worker for each line in file file. The machines defined in file must be accessible via a passwordless ssh login, with Julia installed at the same location as the current host. Each machine definition takes the form [count*][user@]host[:port] [bind_addr[:port]]. user defaults to current user, port to the standard ssh port. count is the number of workers to spawn on the node, and defaults to 1. The optional bind-to bind_addr[:port] specifies the ip-address and port that other workers should use to connect to this worker.

1 Like

thanks salchipapa. it doesn’t like me. the installation is exactly at the same spot in both machines.

$ julia --machinefile machinefile
ssh: Could not resolve hostname 5: nodename nor servname provided, or not known
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::Base.Distributed.SSHManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::Base.Distributed.SSHManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::Base.Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{Base.Distributed.SSHManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:

reading the docs, it seems like all I need to do is to put into the machinefile

5  164.67.165.22

and I should be ready to go. (IP was made up.) is this a correct format (5 processes to be started up on 164.67.165.22.)

password-less and username-less ssh works just fine:

> ssh 164.67.165.22 '/Applications/Julia-0.6.app/Contents/Resources/julia/bin/julia -e "println(\"hello\")"'
hello

back to home

does it need another open port? anything else? what is the simplest way to check what this means?

regards,

/iaw

Try something like this for now:

julia> addprocs([("machine1", 2), ("machine2", 1)])

This will launch 2 workers on machine1 and 1 worker on machine2.

6 Likes

great. this works. it is incompatible with the -p julia switch, but works fine without it. I can add not only localhost, but do plain addprocs(), too. so I am pretty much all set. thank you.

PS: If someone has a working plain machinefile, I am curious what it should have looked like.

I remember now (haven’t used it in a while), it seems that either the documentation is wrong, or that option is not working as intended, try:

machinefile:

164.67.165.22
164.67.165.22
164.67.165.22
164.67.165.22
164.67.165.22

instead of:

5 164.67.165.22

Then:

julia -p 5 --machinefile machinefile

That worked for me last time! This should give you 10 workers 5 local and 5 remote.

1 Like

the machinefile version does not work on my end, at all.

bash$ \julia -p 3 --machinefile machinefile
ERROR: connect: connection refused (ECONNREFUSED)ERROR: ERROR:
Stacktrace:connect: connection refused (ECONNREFUSED)connect: connection refused (ECONNREFUSED)

Stacktrace:
Stacktrace: [1] try_yieldto
( [1] try_yieldto
( [1] try_yieldto::(Base.##296#297{Task}, ::Base.##296#297{Task}, ::Base.##296#297{Task}, ::Task):: at Task./event.jl:189)
 at  [2] ./event.jl:189wait
( [2] )wait at ::(./event.jl:234Task)) at

and tons more output. the addprocs works fine.

regards,

/iaw

Did I understand correctly that the file system on each remote machine must

  • have the exact same path to call julia’s executable
  • have the exact same folder(s) path as the current path on the “calling” machine
    ?

This makes it impossible to distribute jobs between, say, a linux box a Mac and a Windows PC. Could this be intended? Am I wrong?

No. You can use the keyword arguments exename to specify the julia binary location (and the environment to load) and dir to specify the working directory. I oft start workers on a small linux cluster from a windows machine.

params = (exename = `/path/to/julia/bin/julia --project=/path/to/project`, dir = "/path/to/working/directory")
addprocs([("some_remote_machine", 5)]; params...)

This will start 5 processes on some_remote_machine.

For more information see, for example, my parallel computing tutorial here, in particular the “THP cluster” part.

6 Likes

Thank you so much, particularly for the tutorial!

1 Like

Is it possible to let Julia work out automatically the available process on a pool of machines instead of explicitly stating n processes in MachineA? It’s like we specify we want 64 process and tell Julia that we have Machine[A-Z], then it’s up to Julia which CPU to use.

Also, is there any performance difference if I use SSHManager, i.e. by explicitly stating the node as in your tutorial, vs using ClusterManagers (my cluster is PBSManager)? The reason being is I can’t get PBSManager working, since there is no clear example on what’s needed (I opened an issue here)

1 Like

I am trying to connect from my Windows computer to several Linux computers via ssh.

It does not work if I use the REPL from the julia directory. However, it does work if I first start julia from the git bash on Windows and then add the workers. Does anyone experience the same behaviour and is there a reason for this or can it be fixed? I would like to use the REPL in atom (from where it also does not work).

The code I am using is

using Distributed
 addprocs([("john.doe@linuxcomputer", 3)]; dir="/home/john.doe/julia-project", exename="/home/john.doe/software/julia/julia-1.3.1/bin/julia", tunnel=true)

Note: I have activated passwordless ssh login beforehand.

1 Like