Adding remote workers on windows

On Julia v0.4 I was able to add remote workers on a Windows network using ssh (Bitvise).
I customised the startup command to connect using the specific ssh.
However, it no longer works on Julia v0.5.
I do not expect anyone to help directly, but rather would like some advice on how to proceed to further diagnose where the problem lies.

I can start a remote Julia and execute a script.
From the Windows command terminal, the following works (here versioninfo.jl just contains versioninfo())

C:\Users\Greg>sexec RemoteHost "C:\Program Files\Julia-0.5.0\bin\julia.exe" "C:\Users\Greg\versioninfo.jl"
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: NT (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM)2 Duo CPU     E6750  @ 2.66GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Core2)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, core2)
C:\Users\Greg>

However, I can’t seem to successfully start a remote worker.
Again from the Window command line:

C:\Users\Greg>sexec RemoteHost "C:\Program Files\Julia-0.5.0\bin\Julia.exe" --worker vqtjHYxSVjEOVVF6
julia_worker:9009#192.168.1.112
Master process (id 1) could not connect within 60.0 seconds.
exiting.

It seems the worker is spawned and returns confirmation with port and ip address.
But eventually times out.
Is there some kind of handshake not being completed successfully?

Are there any tips on how to further diagnose the problem?

Hello Greg,
I am starting remote Windows worker from Linux machine also using Bitvise.
Facing almost the same error. Have you solved this issue?

Yes, my particular setup is working now.

I think there were many issues. Remote Julia workers on Windows in a corporate environment presents a plethora of pitfalls.

In particular:
Windows might be blocking through firewall. (Can check in Control Panel: Allow an app or feature though Windows Firewall). Although this should not affect you if connecting from Linux master.

Also check no proxy issues or network policy issues.

I also suggest using Bitvise Client to connect to each server the first time. On the first connection to a server, a host key/fingerprint is created via a user dialog (which won’t be completed successfully if you connect a Julia worker in the background)

Not sure what else I can say.

Greg, thanks!
In my case ssh connection seems to be established and according to logs I can see that the command is executed:

However nothing is started. Maybe my command syntax is wrong…
addprocs([“ivan@win-julia2”], dir=“\Julia-0.5.2\bin”,exename=“julia.exe”)

I have also stopped firewall on win machine

Sorry the log hasn’t uploaded

  <event seq="216" time="2017-05-23 00:58:15.282668 +0300" app="BvSshServer 7.31" name="I_EXECS_COMMAND_EXECUTED" desc="Command executed.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <channel type="session" id="1"/>
    <parameters command="cmd.exe /c sh -l -c &quot;cd 'Julia-0.5.2\bin' &amp;&amp; julia.exe --worker zT0oLTnPGTJNL85n&quot;" initDir="C:\Users\ivan" execRequest="sh -l -c &quot;cd 'Julia-0.5.2\bin' &amp;&amp; julia.exe --worker zT0oLTnPGTJNL85n&quot;"/>
  </event>

  <event seq="217" time="2017-05-23 00:58:15.321424 +0300" app="BvSshServer 7.31" name="I_CHANNEL_SESSION_CLOSED" desc="Session channel closed.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <channel type="session" id="1"/>
  </event>

  <event seq="218" time="2017-05-23 00:58:15.323793 +0300" app="BvSshServer 7.31" name="I_SESSION_DISCONNECTED_NORMALLY" desc="Session disconnected normally.">
    <session id="1024" service="SSH" remoteAddress="192.168.56.101:48922" windowsAccount="WIN-JULIA2\ivan"/>
    <parameters disconnectReason="SshError"/>
    <error type="Flow" component="SshManager/transport" class="RemoteSshDisconn" code="ByApplication" description="disconnected by user"/>
  </event>
</log>

This doesn’t look right. Standard Windows doesn’t have sh:

command="cmd.exe /c sh -l -c &quot;cd 'Julia-0.5.2\bin' &amp;&amp; julia.exe --worker zT0oLTnPGTJNL85n&quot;" 

You’ll want to see something like this in the log file (where julia.exe is called directly from cmd):

command="cmd.exe /c c:\...\Julia-0.5.2\bin\julia.exe --worker zT0oLTnPGTJNL85n"

You might need to customise the Julia command used to start workers. The default SSHManager launches with: (see launch_on_machine(manager::SSHManager, ...) in managers.jl)

cmd = `ssh -T -a -x -o ClearAllForwardings=yes -n $sshflags $host $(Base.shell_escape(cmd))`

but you’ll want something like:

cmd = `sexec $host $exename $exeflags`

You might want to copy the default SSHManager and create a custom manager. Probably need to play around a little with the actual command to get authentication right etc.

Greg, thanks! Exactly. Now I am thinking of using either customized ssh manager or clustermanagers with htcondor instead of ssh. However condor jl also requires customization to support windows nodes)

Finally I have installed htcondor and I am trying to start remote workers through condor_submit. I have reconfigured condor.jl to run on windows but I have encountered an error with the way workers are registered.

Workers are started on remote computers with the following .cmd

cd C:\Julia-0.5.2\bin
julia.exe --worker IFiJUqV4N1MuC9Uo | telnet windows-master 8449

But after the execution in stderr I can see

ERROR: write: broken pipe (EPIPE)
 in yieldto(::Task, ::ANY) at .\event.jl:136
 in wait() at .\event.jl:169
 in stream_wait(::Task) at .\stream.jl:44
 in uv_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at .\stream.jl:820
 in unsafe_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at .\stream.jl:830
 in write(::Base.PipeEndpoint, ::Array{UInt8,1}) at .\io.jl:175
 in print at .\strings\io.jl:70 [inlined]
 in start_worker(::Base.PipeEndpoint, ::String) at .\multi.jl:1539
 in process_options(::Base.JLOptions) at .\client.jl:218
 in _start() at .\client.jl:321

Do you have any ideas how to fix it?

Can’t help with HTCondor, just looked it up, sounds great.
Let me know if you get it working on Windows, I’d be very interested.

Greg,
finally my setup is working. It is Windows env with htcondor as job scheduler and Julia submitting request to condor to start workers.
Default condor.jl is written to support linux env but I have changed paths and notation and now it is working with Windows.
I had problems using telnet client to send response to master process from workers. Default is:

cd C:\Julia-0.5.2\bin
julia.exe --worker IFiJUqV4N1MuC9Uo | telnet windows-master 8449

I used ncat instead

julia.exe --worker IFiJUqV4N1MuC9Uo | ncat windows-master 8449
2 Likes

Thanks! When I get a chance I’ll try this out.