Cloud services for intermittent numerical computing in Julia

Some of my research projects work like this: prototype a solution in Julia on a laptop/desktop, followed by a computational phase (eg MCMC running on a server for a week), then again analysis of the result on laptop/destop, lather, rinse, repeat until satisfied.

In the past we built servers and ran them in our offices, but the research grants I am participating in increasingly prohibit buying hardware without a (very) good justification, yet allow spending on services.

Is there a cloud solution which would support this workflow, ie power up an instance for a period of time, possibly unknown in advance, get the data and code from a private git repo, upload the results when done, then power down? Ideally allowing an interactive SSH access of some kind in the meantime.

How do people manage their Julia projects in this context? Do you use some kind of container?

7 Likes

JuliaHub is designed for this kind of usage. While most of its usage is with enterprise, you can put in a credit card and use it as an easy to use scalable CPU/GPU machine. And we know it works well with Julia’s distributed compute and GPU because that’s how it’s generally used.

I’m not sure on the SSH access, some else can comment on that.

If there’s anything necessary to know about to make the services payment work out from university, feel free to get in contact with details and we can figure something out. It is built on AWS so I would assume we could use such credits for part of it? Since JH cost is JH + AWS. But I haven’t looked into that.

7 Likes

Please write to me at sushil.kumar@juliahub.com and I can help clarifying your questions.

1 Like

AWS works pretty well for us.

If you primarily want to use one instance at a time, you can use EC2. This gives you a lot of control over a single (virtual) machine – you can stop and restart the same instance, saving any state. And you can change the size of the machine every time you restart, so e.g. you can do most of your work on a small, cheap machine, then scale it up for an experiment, then scale it back down. And the standard way to access these instances is via SSH. You generally wouldn’t use a container here.

If you want to run a lot of instances at once (e.g. to run many copies of an experiment with slightly different parameters), you’d use Fargate.

A Fargate workflow might look like this:

  • Develop code locally and test it on a small amount of data
  • Build a Docker container that contains the current version of the code
  • Upload the Docker container to AWS ECR
  • Run the tasks using Fargate

It’s a fair bit of work to set up, but it scales very well.

4 Likes

Thanks — the workflow you describe seems ideal for our use case. That said, with Amazon’s rates, just building a dedicated machine would pay for itself in 10–20 weeks, so I will try to see if we can do that in-house.

Thanks. I am asking here instead as I think the topic is of general interest. Given that JuliaHub hourly rates are about double than AWS, can you describe what it adds? I am sure it provides useful functionality for some people, but I just need the bare OS + a working Julia instance.

3 Likes

Depending on the resilience of your program to interruption, you can save a huge amount by using Fragate Spot. I moved a Julia workload from x86 Fargate to ARM Spot Fargate, and it cut our costs by ~70%.

3 Likes

So at a very basic level, Juliahub abstracts all of AWS for you. So you don’t have to worry about managing the disk images, getting data to the VMs, starting a cluster and getting the nodes talking to each other, stopping them exactly as soon as the computation finishes – all of which are possible if you already have the tools and knowhow, but is non-trivial for a lot of scientists. The basic idea is that you just give us Julia code, and we do everything beyond that for you to run it.

So if you run a job on 20 nodes for an hour, take a coffee break, and then forget to turn them off overnight – protecting against that is what you pay for on JuliaHub :). If you have tooling to do all of that already, then it’s probably not that useful. Similarly, if you need a single large machine that you keep on all day, without any need for traceability, it’s probably not for you.

There are other more advanced stuff – reproducibility, content hashed datasets, authenticated private registries, project sharing etc, as the more enterprisey features.

Hope this helps. Let me know if there are any follow up questions.

Regards

Avik

6 Likes

What’s your experience with ARM, compared to x86? Is there a significant performance difference? A colleague also suggested Oracle Ampere, which is ARM-based, but I have never used ARM before.

That looks like a valuable feature; I found that I can manage these things if I have to but they involve a lot of time and frustration.

Having a free tier also allows me to check out various options, I will report back here when I have finished. In the meantime, additional suggestions are welcome from everyone. I think that the days maintaining a server on site are gone and I need to adapt to the new reality :wink:

2 Likes

It depends a lot on the specific chips. There are pretty good aarch64 chips, as there are pretty good x86_64 chips. Perhaps “pretty good x86_64 chips” are more common than aarch64 just because Intel had a head start.

For what is worth, a few months ago I collected some Julia benchmarks on AmpereOne A192-32X, which maybe can give you an idea of the performance to expect on that platform.

1 Like

I’m not a good test case because lots of my process is IO bound, so CPU performance doesn’t play a huge role. That said, it didn’t tank my performance unexpectedly or anything like that.

1 Like

Yes, AWS is definitely expensive if you’re using any amount of capacity regularly. You can mitigate this somewhat by using Reserved Instances, but you’ll definitely still be paying a lot.

We switched from AMD to ARM a few years ago and saw +/-5% differences on our workloads, but with a ~30% cost reduction that made it worth it.

1 Like