Parallelize Julia on AWS

What is the best way to parallelize Julia on AWS? I’ve come across KissCluster, but I couldn’t tell how commonly this is used (also I’m getting bunch of errors when I try to create the cluster with Kiss). Any ideas or manual that you can point me to?

2 Likes

Just to be on the same page, have you read relevant part of manual and JuliaComputing blog post on the topic?

Yes, I have read those (I wasn’t aware of the blog post actually).
I have been using ClusterManager on the cluster provided by my institution. I am wondering how I can set up a similar environment on AWS. Do I create several instances and then use SSHManager to connect to them? Or should I create a cluster using something like Kiss or something else that is out there and submit jobs? I am just trying to understand whether there is a common way the Julia community approaches this question.

There are few approaches to parallelization of Julia on AWS:

  1. Run a single Julia across a cluster of computers using function available in Distributed package - this is achieved with Julia’s --machine-file command line argument
  2. Run independent sets of Julia processes across the cluster (this is achieved with the -p command line parameter). The number of processes is usually equal to the number of available cores. Use Distributed to manage the set. You need to use some external tool to coordinate execution.
  3. Run independent single Julia processes and use an external tool.
  4. In scenarios [1] & [2] use multi-threading within a single server instead of multi-processing or mix multi-threading with multi-processing.

The choice of a scenario is usually determined by the type of computational problem. If you run huge simulations (with plenty of memory requirements) you will choose [1], [2] or [4]. On the other hand
[3] gives you most flexibility with regard to scaling, followed by [2] and [4]. I have very often a situation when I see that my computations take too long to complete and want to chip-in 20 or 50 additional servers (I am very inpatient!).

Regardless of the scenario you need to choose how to orchestrate this process. On AWS you have the following options:

  1. Use CfnCluster by Amazon
  2. Use KissCluster - all the above scenarios are supported
  3. Manage the process yourself

CfnCluster (cloud formation cluster) allows you to setup a cluster management software similar to one available on supercomputers. Reasonable options include SGE and SLURM. Once installed you use ClusterManagers.jl to manage the computational process. The main problem I had with this option is that AWS cloud formation is slow - it takes around 15 minutes before you get the cluster.

Since I did not like waiting, I have developed KissCluster which is basically a bunch of simple bash scripts that orchestrates the distributed computation process.

Since Julia supports distributed computing in a very beautiful way, for the scenario [1] do-it-yourself might be reasonable option in some scenarios. All you need to do is to configure passwordless SSH between the nodes of your cluster and properly construct the machinefile contents.

Yet another issue is scaling the cluster and utilizing AWS EC2 Spot Fleet. AWS EC2 Spot instances allow you to buy computing power very cheaply - at around 0.01$ per vCPU core per hour. Somehow the best Spot prices are always at Ohio region and it is usually the best place to run such computations.
A Spot fleet allows you to easily control and increase the size of a spot cluster. In order for configure it with your simulation you need to create an AMI along with cloud-init. KissCluster will generate such script for you (or you can write it yourself). The CfnCluster supports a group of spot servers but the servers are not so loosely coupled as in case of KissCluster and hence the Spot Fleet mechanism is not supported.

If you describe the details of your computational setting I can help you to choose the scenario.
Should you have any questions let me know I will be glad to answer them.

2 Likes

And of course paste the errors that you have with KissCluster to this thread so I will explain what to do (or if it look like a bug make a GitHub issue)

Thank you so much for the detailed explanation!!
On the cluster that I have access to right now, I use ClusterManagers (and Distributed in v0.7) to parallelize computing values and filling a multidimensional array within an optimization algorithm. I have a master file which addprocs() after the job is submitted to the queue on the cluster. I use @distributed and pmap interchangably - right now one is not performing significantly better than the other. I am not completely sure which approach this corresponds to, but please let me know if I should provide more information.

OK, for the KissCluster - here is what I did:

  1. I clicked on the AWS Cloud Formation Script, and created the S3 bucket.
  2. I created the security group (to allow access from within itself).
  3. Created an Ubuntu Linux instance, using the following AMI. I picked general purpose t2.micro and didn’t configure the security groups (should I have used the security group I just created?) Then I launched the instance.
  4. I connected to the instance and typed:
wget -L https://github.com/pszufe/KissCluster/archive/0.0.5.zip
unzip 0.0.5.zip
cd KissCluster-0.0.5/

Then I typed
./kissc create --s3_bucket s3://kissc-data-1lp6vrwddvakh myc@us-east-2
but got the error

dpkg-query: no packages found matching jq

Next, I typed
sudo apt --yes install jq awscli
and tried to create the cluster once again. This time, I got the following error:

DynamoDB table kissc_clusters not found
Creating DynamoDB table kissc_clusters
Unable to locate credentials. You can configure credentials by running “aws configure”.

This is where I got stuck. I don’t know if this is a bug, but I’m guessing not. I tried to follow the instructions, but perhaps I missed something? Any help will be appreciated!!

Thanks!

1 Like

You are on good way.
KissCluster uses DynamoDB and S3 to store cluster meta-data and S3 to store computation state. Hence your EC2 instance needs to have the access rights to write to those places.

The recommended approach is to attach instance IAM Role profile to an EC2 instance.
Since you run KissCluster’s Cloud Formation script such IAM Role profile has been created for you. It is named similarly too: kissc-KisscInstanceProfile-1581BJVRWYZ1Y

Now what you need to do:

  1. Right click the EC2 instance
  2. Instance settings → Attach/Replace IAM Role
  3. Select the KissCluster role and click “Apply”

That’s all, let me know if you have any other problems.

Great, it worked! I now have a cluster with 10 nodes to try things out.
My next question then is how to submit this job. For example, let’s say I have a master.jl that includes a code like:

using ClusterManagers
myprocs = addprocs(10)
@everywhere include("setup.jl")
.....
rmprocs(myprocs)

should I just submit the job by:
./kissc submit --julia "master.jl" --folder samplecode/ --max_jobid 10 myc@us-east-2
(I edited the .profile file so that julia command works).

And finally, what do I need to do if I want to keep a log file to see the screen output (optimization iterations etc.)? This is how I submit jobs on a linux cluster
nohup julia master.jl > ./log_sample &
and I was wondering if there is an equivalent way of keeping a log file with this approach as well.

Thanks for all your help!

KissCluster can fully manage logging for you in the scenario [3] or scenario [4] where multi-threading is used without multiprocessing.
In those scenarios you can use KissCluster to run a distributed loop across any number of processes running across any number of servers (and KissCluster controls the maximum level of parallelism so jobs are being queued).
In those scenario:

  • KissCluster attaches logging to your standard output and standard error processes (see the picture at KissCluster’s documentation)
  • Provides automatic naming of job files
  • Stores the information on running jobs in DynamoDB tables
  • compresses the log files (containing standard output and standard error) once the processes completes
  • uploads the compressed logs to S3 creating a directory tree that makes it easy to store results of several different computations.

Since KissCluster is itself language independent (I have been using it with Julia, Python, Java and R) it does not know about the child processes that you can add via Distributed.addprocs() or -p or with ClusterManagers.jl. In all those scenarios KissCluster will be able to log only the standard output and standard error of the master process (it might be OK with you or not).

To submit a job with julia you can do:

./kissc submit --job_command "julia master.jl" --folder samplecode/ --max_jobid 10 myc@us-east-2

This will tell the cluster to run the master.jl ten times. Please note that the julia command must exist (e.g. must be reachable in the system PATH). Standard output a error of master.jl will be logged along the rules described above.

Now there is one important issue - in KissCluster submitting a job is decoupled from the actual execution. This design was taken by the fact the normally you wish to run workers on EC2 Spot instances and your computing capacity may come and go. This is different approach from Julia --machinefile or ClusterManagers.jl that assume that all servers will be available throughout all execution of computation.

When you run KissCluster a cloud-init script is executed. You can execute this script on any EC2 instance (that has IAM Role discribed in my previous post) and this instance will become worker in the cluster. Normally I create an AMI containing Julia along with all packages needed to run computations and than I launch this AMI as an EC Spot providing the above script as the cloud-init instance launch parameter.

I have tested this approach for EC2 Spot clusters up to 200 nodes (servers). Please note that for bigger clusters you need to adjust read- and write- through-output capacity of DynamoDB tables generated by KissCluster.

Should you have more questions just ask.

Thanks, for detailed explanation! @pszufe what is the roadmap for KissCluster, are you planning to maintain and update it on Julia 1.x ? I am also looking for a solution to schedule and run tasks (differential equations and computationally expensive optimization problems) on “on-demand” nodes (in future both cloud and on-prem but I want to start with cloud-based solution). I wanted to try kubernetes. Could you please explain in which cases KissCluster and other solutions mentioned here are more suitable and reasonable than container based orchestrators?

KissCluster is written in bash so it works in any Julia version (there is no direct Julia dependency and it works also with Python, Java and R).

KissCluster’s spot instance support is built on the fact that there is not central (master) node - instead there is just a DynamoDB table that holds the information about jobs and worker nodes.
Hence you can use KissCluster with in on-demand instances. An alternative for on-demand clusters is to use tools such as Slurm.

The difference is that KissCluster is very lightweight and is good for scenarios such as “run a cluster of 200 servers for 15 minutes and then terminate the cluster” while tools such as Slurm are much more advanced and offer functionality for maintaining and managing permanent long running clusters.

Hope that helps.

Thanks, I see.