Julia cluster using docker

question

#1

Hi,
I have asked this in stackoverflow.com and I was recommended to post it here:

I am trying to connect to docker containers using the default SSHManager. These containers only have a running sshd, with public key authentication, and julia installed.

Here is my dockerfile:

FROM rastasheep/ubuntu-sshd
RUN apt-get update && apt-get install -y julia
RUN mkdir -p /root/.ssh
ADD id_rsa.pub /root/.ssh/authorized_keys

I am running the container using:

sudo docker run -d -p 3333:22 -it --name julia-sshd julia-sshd

And then in the host machine, using the julia repl, I get the following error:

julia> import Base:SSHManager
julia> addprocs(["root@localhost:3333"])
stdin: is not a tty
Worker 2 terminated.
ERROR (unhandled task failure): EOFError: read end of file
Master process (id 1) could not connect within 60.0 seconds.
exiting.

I have tested that I can connect to the container via ssh without password.

I have also tested that in julia repl I can add a regular machine with julia installed to the cluster and it works fine.

But I cannot get this two things working together. Any help or suggestions will be apreciated.


#2

Hello @torce,

I recommend you to also deploy the Master in a Docker container. It makes your environment easily and fully reproducible.

I’m working on a way of deploying Workers in Docker containers on-demand. i.e., the Master deployed in a container can deploy further DockerizedJuliaWorkers. It is similar to https://github.com/gsd-ufal/Infra.jl but assuming that Master and Workers run on the same host, which makes things not so hard.

It is an on-going work and I plan to finish next weeks. In a nutshell:

  1. You’ll need a simple DockerBackend and a wrapper to transparently run containers, set up SSH, call addprocs with all the low-level parameters, get containers runtime info, etc. (i.e., the DockerizedJuliaWorker.jl file):
  1. Read here how to build the Docker image (Dockerfile is included):

Please tell me if you have any suggestion on how to improve it.

Best,

André Lage.


#3

Hi André Lage,
I have been looking at your project on github. Been thinking about such a setup ever since I started using Julia for my heavy compute workloads.

I am curious to get your thoughts on using your setup or Infra.jl on docker swarm.
Should one expect your scripts to work inside a docker swarm cluster?

Looking forward to getting your thoughts on this :slight_smile:

Best Mikkel Holm
Data Scientist.


#4

hi Mikkel,

I would divide the whole problem into three parts:

  1. Infrastructure: VMs, bare metal servers, etc. for deploying containers.
  2. Docker Backend: code for deploying Docker containers and making them reach each other (complete graph virtual network topology).
  3. Dockerized Worker: create each Julia Worker in a container, it means to use addprocs with specific parameters and provide a set of handy management functions.

Docker Swarm Clusters on Azure addresses 1 and 2.

Infra.jl addresses 1, 2, and 3, but it works on Julia 0.4 (it is the result of Raphael Ribeiro’s bachelor thesis). Moreover, if you want to make Infra.jl scalable, you’ll need to decentralize its HTTP server which provides single ports for configuring SSH servers for each deployed container (or thinking of a smarter solution for this). I don’t know how Azure addresses this, but they might have a good decentralized solution as the demo video says that one can deploy hundreds of Docker container in 5 minutes.

DockerizedJuliaWorker.jl addresses 1, 2, and 3, but it works only for a single machine (VM, bare metal server, etc.).

Now, answering your question “Should one expect your scripts to work inside a docker swarm cluster?”: Yes, since one implements a specific DockerBackend.jl by only changing specific lines on the existent one. For instance, to implement an AzureDockerBackend.jl which runs Docker containers, one should only change this line.

I wrote a long answer to welcome yours and others comments and/or further contributions on the problem :slight_smile:

Best,

André Lage.