Distributed Julia in the cloud

I know that JuliaPro is doing a good job of running Julia across a cluster in Azure. However, I have not seen an easy to follow guide to spin up a cluster in AWS/GCP/Azure without the paid version.
I am really fed up with Spark on Scala/Python and the mess that it creates at my current company. I would like to replace all of this with Julia running on Kubernetes that we could deploy in multiple cloud environments.
Does anyone have experience launching such compute environments, especially for analytical workflows? I don’t GPUs at the moment, just the ability to read in large parquet files (10TB+) and process them in parallel across multiple workers.

1 Like

hi, are you considering only open source solutions or also commercial products? If you are open to the latter, you may want to check out http://juliustech.co, it is a graph computing solution built on Julia that can automatically build complex data/analytical pipelines and distribute to cloud environment, well suited for large and complex analytical jobs.

I’ve not used it myself but

https://github.com/banyan-team/banyan-julia

Is designed for this scenario

Although there is a flat fee involved

https://www.banyancomputing.com/custom-scripting/

@calebwin can provide more details

@lawless-m Thanks for the mention, Matt!

@niczky12 Hi Bence - Couple of potential solutions for your Parquet data analysis use-case using Banyan:

  1. Banyan Custom Scripting let’s you run Julia scripts with manual parallelism on an auto-scaling cluster in your Virtual Private Cloud
  2. BanyanDataFrames.jl is similar but it automatically parallelizes DataFrames.jl computation using the same API

Other options to consider: JuliaHub, Julius looks interesting, manually setting up an EC2 cluster and installing/managing Julia

Some advantages of Banyan:

  1. One nice thing about Banyan is that it’s compatible with existing APIs (DataFrames.jl and MPI.jl) so if you decide to stop using Banyan, you don’t have to migrate your workload to a different platform. You can pretty much take your code as is and manually set up a cluster on premises or in the cloud and run your code there.
  2. Other advantages compared to other services include ability to use Banyan from any environment outside VS Code, cloud computing sessions that start in as little as <10 s, and easier access to Amazon S3.

Some limitations: It’s AWS only and also it’s fairly new (even compared to Julia itself which is relatively “new”). We’re currently benchmarking BanyanDataFrames.jl to compare it to other solutions which don’t have the same automatic CPU cache optimizations that Banyan has.

There’s also the getting started steps and we hope to have some video tutorials soon but please let me know if there’s anything immediate that you would like to know regarding any of the Banyan Julia packages :slight_smile:

1 Like

Invenia uses: https://github.com/JuliaCloud/AWSClusterManagers.jl

That looks quite cool! do you know approximately how much the cold job startup time is?

IIRC: Long. Like 5-20 minutes?
Depend on a number of things.

yeah. i’ve been playing around with K8sClusterManager and it works well with kubernetes: https://github.com/beacon-biosignals/K8sClusterManagers.jl

1 Like

JuliaHub.com makes it very easy to spin up a cluster and run a distributed job on it on AWS (Azure will be coming later in the year); it also makes running GPU jobs or even distributed GPU jobs easy. You can use it as an individual on juliahub.com immediately, although you’ll need to put down a credit card to use compute if you want to give it a try. We used to offer $25 free compute to try it out, but the crypto miners ruined that :cry:. For a company, you can get company.juliahub.com as an SSO-integrated entry-point where you can collaborate on private projects, share data sets, manage private registries and packages, and so on. Feel free to ping on JuliaLang Slack if you have any issues or questions—there’s a #juliahub-usage channel.

5 Likes

I should probably add that JuliaHub is SoC2 (Type 1) compliant since a lot of companies care about security certification for services that they use for compute and data management.

What exactly are your pain points with Spark? At first glance, it looks like a perfect solution for analytics on 10Tb+ of parquet files, a way more suitable than a custom Kubernetes-based system where you’d need to implement coordination of containers, shuffling, IO, etc. by yourself.

1 Like

Thanks, everyone for the quick and insightful replies. This was my first post here and didn’t expect so many eyes looking at this!

1 Like

I do agree that Spark sounds like the tool for jobs like this. Except I used BigQuery which 90% of what spark could do a lot faster and cheaper with a lot less code.
I find spark to be hard to test and the fact that jobs can die after an 1hr of processing because I didn’t pick the right workers is super annoying. Then if you want to dig into why things failed, you’re met with a pile of different languages flying around. I use pyspark mostly so getting Java errors is not a nice experience at all. That’s why I was wondering if Julia had a competitor solution that’s purely in Julia. That would solve my problem of hating Java and would also avoid the vendor lock in of BigQuery for which I’d need a GCP licence.
Maybe I’m just using Spark wrong, but I do dream of a world where Julia could be easily used for things such as these.

Banyan looks nice, but I’d doubt I’d get away with those permissions. :smile:

If BigQuery is good enough for your case and you only worry about vendor lock, try another SQL database. For example, AWS has managed Athena, which is based on open-source Presto / Trino, so you can both - start quickly and migrate easily if you have issues with AWS.

I find spark to be hard to test and the fact that jobs can die after an 1hr of processing because I didn’t pick the right workers is super annoying.

Unfortunately, the need to select the right resources is almost unavoidable in big data applications. BigQuery is somewhat special here because it provides you with virtually unlimited resources, but it comes with a risk to get a virtually unlimited bill at the end of the month. All other systems I’ve worked with - including Spark, HBase, Vecrtica, Redshift, Presto and others - require you to plan resources in advance. Kubernetes is no exception here, even if you manage to implement a general-purpose data processing workflow on it.

For Azure, you can check out AzManagers.jl or AzureClusterlessHPC.jl. Note that both are Azure specific, and neither are kubernetes specfiic. So, it’s not a direct answer to your question, but thought it might be interesting information none-the-less.

Thanks for pointing that out, we definitely don’t need all of those permissions :smile:

But it does sound to me like Spark is your ideal option - have you seen @dfdx’s Spark.jl?

I should admit Spark.jl hasn’t seen much development recently, and definitely doesn’t solve PySpark’s issues mentioned here :see_no_evil:

So is this Julia language thing good at cryptocurrency mining then? :wink: