On Julia v0.4 I was able to add remote workers on a Windows network using ssh (Bitvise).
I customised the startup command to connect using the specific ssh.
However, it no longer works on Julia v0.5.
I do not expect anyone to help directly, but rather would like some advice on how to proceed to further diagnose where the problem lies.
I can start a remote Julia and execute a script.
From the Windows command terminal, the following works (here versioninfo.jl just contains versioninfo())
C:\Users\Greg>sexec RemoteHost "C:\Program Files\Julia-0.5.0\bin\julia.exe" "C:\Users\Greg\versioninfo.jl"
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: NT (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Core2)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, core2)
C:\Users\Greg>
However, I canât seem to successfully start a remote worker.
Again from the Window command line:
C:\Users\Greg>sexec RemoteHost "C:\Program Files\Julia-0.5.0\bin\Julia.exe" --worker vqtjHYxSVjEOVVF6
julia_worker:9009#192.168.1.112
Master process (id 1) could not connect within 60.0 seconds.
exiting.
It seems the worker is spawned and returns confirmation with port and ip address.
But eventually times out.
Is there some kind of handshake not being completed successfully?
Are there any tips on how to further diagnose the problem?
I think there were many issues. Remote Julia workers on Windows in a corporate environment presents a plethora of pitfalls.
In particular:
Windows might be blocking through firewall. (Can check in Control Panel: Allow an app or feature though Windows Firewall). Although this should not affect you if connecting from Linux master.
Also check no proxy issues or network policy issues.
I also suggest using Bitvise Client to connect to each server the first time. On the first connection to a server, a host key/fingerprint is created via a user dialog (which wonât be completed successfully if you connect a Julia worker in the background)
You might need to customise the Julia command used to start workers. The default SSHManager launches with: (see launch_on_machine(manager::SSHManager, ...) in managers.jl)
You might want to copy the default SSHManager and create a custom manager. Probably need to play around a little with the actual command to get authentication right etc.
Greg, thanks! Exactly. Now I am thinking of using either customized ssh manager or clustermanagers with htcondor instead of ssh. However condor jl also requires customization to support windows nodes)
Finally I have installed htcondor and I am trying to start remote workers through condor_submit. I have reconfigured condor.jl to run on windows but I have encountered an error with the way workers are registered.
Workers are started on remote computers with the following .cmd
cd C:\Julia-0.5.2\bin
julia.exe --worker IFiJUqV4N1MuC9Uo | telnet windows-master 8449
But after the execution in stderr I can see
ERROR: write: broken pipe (EPIPE)
in yieldto(::Task, ::ANY) at .\event.jl:136
in wait() at .\event.jl:169
in stream_wait(::Task) at .\stream.jl:44
in uv_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at .\stream.jl:820
in unsafe_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at .\stream.jl:830
in write(::Base.PipeEndpoint, ::Array{UInt8,1}) at .\io.jl:175
in print at .\strings\io.jl:70 [inlined]
in start_worker(::Base.PipeEndpoint, ::String) at .\multi.jl:1539
in process_options(::Base.JLOptions) at .\client.jl:218
in _start() at .\client.jl:321
Greg,
finally my setup is working. It is Windows env with htcondor as job scheduler and Julia submitting request to condor to start workers.
Default condor.jl is written to support linux env but I have changed paths and notation and now it is working with Windows.
I had problems using telnet client to send response to master process from workers. Default is:
cd C:\Julia-0.5.2\bin
julia.exe --worker IFiJUqV4N1MuC9Uo | telnet windows-master 8449