Building cluster for Julia parallel computations

parallel

#1

Hi ,
What is the best way to set up cluster using local lab PCs for parallel Julia computations?
In our lab we have hybrid environment (Windows-Linux) dominated by Windows servers.
I have seen ClusterManagers and MPI libraries, but ClusterManagers jl seems to support Linux environment and MPImanager is said to support Julia 0.4 only.
Could you please give me some advice for this use-case?


#2

This may not be the most elegant solution, but it worked for me in the past. I set up the ssh capability between the computers and added the procs like so:

addprocs(["myComputer1@IP1","myComputer2@IP2", "myComputer2@IP3"],dir="/Applications/Julia-0.4.0.app/Contents/Resources/julia/bin/")

Each entry is the computer name followed by the IP address. dir points to the file that launches Julia. If its in a different location on each computer, it might be possible to pass a vector of locations.


#3

Christopher, thanks!
I am trying to use my Linux machine as master process and start workers on remote Windows using Bitvise ssh server. I can see that ssh connection is established but nothing is started. Trying different types of command syntax.

addprocs(["ivan@win-julia2"], dir="\Julia-0.5.2\bin",exename="julia.exe")

  <event seq="216" time="2017-05-23 00:58:15.282668 +0300" app="BvSshServer 7.31" name="I_EXECS_COMMAND_EXECUTED" desc="Command executed.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <channel type="session" id="1"/>
    <parameters command="cmd.exe /c sh -l -c &quot;cd 'Julia-0.5.2\bin' &amp;&amp; julia.exe --worker zT0oLTnPGTJNL85n&quot;" initDir="C:\Users\ivan" execRequest="sh -l -c &quot;cd 'Julia-0.5.2\bin' &amp;&amp; julia.exe --worker zT0oLTnPGTJNL85n&quot;"/>
  </event>

  <event seq="217" time="2017-05-23 00:58:15.321424 +0300" app="BvSshServer 7.31" name="I_CHANNEL_SESSION_CLOSED" desc="Session channel closed.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <channel type="session" id="1"/>
  </event>

  <event seq="218" time="2017-05-23 00:58:15.323793 +0300" app="BvSshServer 7.31" name="I_SESSION_DISCONNECTED_NORMALLY" desc="Session disconnected normally.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <parameters disconnectReason="SshError"/>
    <error type="Flow" component="SshManager/transport" class="RemoteSshDisconn" code="ByApplication" description="disconnected by user"/>
  </event>
</log>

#4
  <event seq="216" time="2017-05-23 00:58:15.282668 +0300" app="BvSshServer 7.31" name="I_EXECS_COMMAND_EXECUTED" desc="Command executed.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <channel type="session" id="1"/>
    <parameters command="cmd.exe /c sh -l -c &quot;cd 'Julia-0.5.2\bin' &amp;&amp; julia.exe --worker zT0oLTnPGTJNL85n&quot;" initDir="C:\Users\ivan" execRequest="sh -l -c &quot;cd 'Julia-0.5.2\bin' &amp;&amp; julia.exe --worker zT0oLTnPGTJNL85n&quot;"/>
  </event>

  <event seq="217" time="2017-05-23 00:58:15.321424 +0300" app="BvSshServer 7.31" name="I_CHANNEL_SESSION_CLOSED" desc="Session channel closed.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <channel type="session" id="1"/>
  </event>

  <event seq="218" time="2017-05-23 00:58:15.323793 +0300" app="BvSshServer 7.31" name="I_SESSION_DISCONNECTED_NORMALLY" desc="Session disconnected normally.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <parameters disconnectReason="SshError"/>
    <error type="Flow" component="SshManager/transport" class="RemoteSshDisconn" code="ByApplication" description="disconnected by user"/>
  </event>
</log>

#5

The remote workers are trying to run sh -l -c ... which is unix-specific. Would need some patches to make it capable of working with Windows workers.


#6

Thanks, exactly. Now I am thinking of using clustermanagers with htcondor instead of ssh. However condor jl also requires customization to support windows nodes)


#7

Finally I have installed htcondor and I am trying to start remote workers through condor_submit. I have reconfigured condor.jl to run on windows but I have encountered an error with the way workers are registered.

Workers are started on remote computers with the following .cmd

cd C:\Julia-0.5.2\bin
julia.exe --worker IFiJUqV4N1MuC9Uo | telnet windows-master 8449

But after the execution in stderr I can see

ERROR: write: broken pipe (EPIPE)
 in yieldto(::Task, ::ANY) at .\event.jl:136
 in wait() at .\event.jl:169
 in stream_wait(::Task) at .\stream.jl:44
 in uv_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at .\stream.jl:820
 in unsafe_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at .\stream.jl:830
 in write(::Base.PipeEndpoint, ::Array{UInt8,1}) at .\io.jl:175
 in print at .\strings\io.jl:70 [inlined]
 in start_worker(::Base.PipeEndpoint, ::String) at .\multi.jl:1539
 in process_options(::Base.JLOptions) at .\client.jl:218
 in _start() at .\client.jl:321

Have you encountered smth similar? Do you have any ideas how to fix it?


#8

Ivborrisov, you have to ask yourself what is the cost/benefit analysis here?
If you have to expend a lot of effort to get it running on Windows then is it worth the effort.
By that I mean:
how much would it be in terms of money cost or effort to get Linux servers
Or hire them on a cloud provider

Being more constructive - you have Windows servers. Run Linux on VMs on those windows servers… set up your cluster!


#9

John, thanks!
Finally I am using ClusterManagers/htcondor with my windows servers. I had to customize condor.jl to support Win environment but now it is working


#10

Could you share what you did? This is the kind of thing that needs to be done once by someone very motivated. Glad to hear it worked out :slight_smile:


#11

Sure,
I have put my version of condor.jl here


I have to make it more “prod-ready” because now it is more like a draft but it is working (on 0.5.2)
Please, see requirements in README
I tried to use default telnet to send ip and port details from remote workers but had BROKEN PIPE messages so didn’t have much time to investigate and I used ncat on remote workers:
https://nmap.org/ncat
For the multithread case on remote workers you can uncomment the line
println(scriptf, “set JULIA_NUM_THREADS=%NUMBER_OF_PROCESSORS%”)