I’ll be working on a Hadoop cluster (hdfs+yarn). I’m checking the best options to integrate Julia in this environment to handle big data distributed processing. Any ideas on the best options?
Spark.jl? Elly.jl? Mount hdfs and use Distributed stdlib? Ignore all hadoop software and use just Julia with Distributed stdlib? MPI.jl ?
One thing I had in mind before this opportunity is that I could just use plain Julia and use a shared NFS mount to ask workers to load chunks of data. But since Hadoop is all about solving this kind of use case, I’m trying to find out which good parts I should incorporate in my workflow using Julia.
I think the better questions to ask first are really about your workload and desires, which will best determine what approach would work for you:
- How much data are we talking?
- What will you be doing with it?
- How often does it need to be processed?
- Do you require extreme resiliency to server/process failure? Or can you just restart the job if it fails?
- What sort of hardware are you running on (HDD vs SSD, network speed, RAM, CPU)?
The dataset is about 21TB in size. Most of the time the user will ask to process a 17GB chunk of it. It will increase 17GB daily.
All the processing is already implemented in Julia and I rely a lot on ForwardDiff. The results are always reports on the data, so the input data will not be changed by the Juila process.
The service will be up all the time, exposed as a HTTP REST API. Users can ask to process different chunks of data with different processing parameters. I’m expecting about 100-500 analysis requests a day, but I can’t tell how much this will increase. But I would like to answer the user request as fast as possible.
I don’t mind if the job fails. I can just restart the job.
The hardware is old but gold. 10-year-old Xeon servers with 1TB of HD, 200GB of RAM, 12CPUs on each node, running at a regular local network. CPUs are really slow. The servers are virtualized. I still don’t know how many nodes I’ll be getting, but I guess I’ll start with 5.
I’m thinking it is better to use all CPUs all the time, since the CPU is really slow, and the CPU intensive part can be set to run in parallel.
Since I’m just getting to know this technology (Hadoop stack), I would like to know how people are using Julia in this environment, if there is any synergy between them.
Another thing that is being discussed is to abandon Julia and migrate to Hadoop / Spark.
That amount of data is definitely large enough to warrant considering Hadoop or Spark. Given that I don’t know much about Hadoop or Spark, however, the only negative thing I can say about them is that I’ve heard that they’re a pain to setup and maintain, and they aren’t super performant. But if you need great fault tolerance, Hadoop is apparently something of a gold standard.
Given that you don’t care about fault tolerance as much, the amount of data you’ll process at one time is not very large, and your data being mostly immutable (no in-place changes by Julia), JuliaDB might be something to consider. There are of course some rough edges to it right now, namely the tendency for it to crash and not recover on large, distributed datasets (which I’m working on fixing in Dagger), and its large memory usage (which someone is already planning on tackling in the next few weeks).
However, if you didn’t want to use JuliaDB/Dagger, having all of your processing already done in Julia means that you have a high cost of switching to another system and language (namely something that runs on the JVM), and unless you have extensive experience with Hadoop/Spark, I don’t see a good rationale for switching your processing code over to use them. You’re probably better off in the short- and long-term with making Julia’s parallel capabilities work for you, in my opinion.
I would recommend starting with setting up a small-scale proof-of-concept of your desired pipeline using JuliaDB or similar, and benchmarking it with example queries to see if it handles the load you need. If you find it does not handle the load adequately, then we can try to work with you to see if we can tweak JuliaDB et. al to better handle your usecase. If that doesn’t pan out, then it’s safe to consider investing time in setting up a Hadoop/Spark stack proof-of-concept, and see if that’s more suitable for your usecase.