Developing a Beginner's Roadmap to Learn Julia High Performance Computing for Data Science

Hi all :wave:

Hope you are doing well! In alignment with both my professional and personal interests, I am slowly starting to turn my attention to the domain of High Performance Computing for the purposes of:

  • Designing high throughput data pipelines in Julia
  • Working with multi-threaded processes
  • Handling the processing of larger than memory datasets
  • Efficient applications of algorithms to time series data (on and offline)
  • Maximizing use of hardware/software capabilities

If this sounds somewhat vague and hand-wavy, that is because it totally is. I am very much a beginner in this area and am trying to develop a beginner’s roadmap for learning the above material. If those are even the areas I should be thinking about in the realm of High Performance Computing for Data Science approaches.

I have been working through Performance slowly, learning more about binary data storage formats like HDF5 and Apache Arrow (my current favorite), interfacing with relational databases from Julia, and picking up tips and tricks here and there. However, as the total of this thread says, I am made this post to be somewhat of a nexus point to collate resources one could use to scope out a beginner’s roadmap for learning about High Performance Computing applied to aspects of data science.

For example, LoopVectorizations.jl looks amazing but I am not sure if it would be applicable to data science toolings. Parallel computing with Distributed looks good, but not sure how to effectively use it in conjunction with the awesome tools like DataFrames.jl and Arrow.jl. And the list continues.

So would anyone be willing to help comment on approaches to building a beginner’s roadmap, sharing what you wished you knew up front about High Performance Computing before diving in, and resources/strategies on learning best practices?

Thank you kindly and have a wonderful day!

Yours,

~ tcp :deciduous_tree:


Additional Background

What do I mean by Roadmap?

What I mean by “roadmap” is a step by step learning approach to take in surmounting this problem. It could be along the lines of:

  1. Read the Julia Docs
  2. Check out packages X, Y, Z
  3. Go practice things at

Right now, I am trying to map out, to paraphrase a quote, what I know, what I don’t know, and what I don’t know that I don’t know.

What is meant by Data Science?

Honestly, I have left this intentionally vague as data science means SO much across the tech world. If you have experience in working in data science in any way with a focus on High performance computing, please share your thoughts here.

What is my background?

I am not a computer scientist! I come from the world of biomedical engineering, healthcare informatics, and academia - so my responses will be colored by that lack of knowledge. :slight_smile:

14 Likes
  1. Learn MPI
  2. ???
  3. Profit
2 Likes

Only partly joking :wink: – you really can’t go wrong with learning MPI.

While for very very large problems you may want to go with some sort of hierarchical parallelism approach with a shared memory model within nodes and a distributed memory model between nodes, it’s really the shared part that is optional. Latency between nodes is killer, and shared memory generally really doesn’t scale well beyond a single node. And if you want granularity in deciding exactly what information critically needs to be shared where, MPI is really the de facto way to do that.

Some other more general advice IMHO

  • Write like you’re writing in C. Do things manually as much as possible.
  • Amdahl’s law: know it, love it
  • Do use vector registers, if you’re on CPU. LoopVectorization is awesome for that. If your problem isn’t amenable to that, make it amenable.
  • While I’ve tried to avoid having to use GPUs, that’s probably a losing proposition long term. Especially given that “Xeon Phi” and such has not really taken off, and systems like Cori may be the last serious supercomputers to use them. Most (I suspect all) of the Exascale systems in development are getting a substantial majority of their flops from GPUs. Which is another reason Julia is great because I’d so much rather use CUDA.jl than raw CUDA. (That said, I still think there will be a niche for CPU-only compute for a long time yet – quantum chemistry anyone?)
4 Likes

Hi,
Learning MPI was difficult for me because of three main reasons. First of all, the online resources for learning MPI were mostly outdated or not that thorough. Second, it was hard to find any resources that detailed how I could easily build or access my own cluster. And finally, the cheapest MPI book at the time of my graduate studies was a whopping 60 dollars - a hefty price for a graduate student to pay. Given how important parallel programming is in our day and time, I feel it is equally important for people to have access to better information about one of the fundamental interfaces for writing parallel applications.

Regards - Data science Training in pune

Hey @brenhinkeller ! Thanks for the thoughtful advice and tips! As these are terms that I am largely unfamiliar with, apologies if my follow-ups here are naive:

  1. Regarding MPI, I looked into it some and am curious how I can see linking this to a data science type of workflow. Do you have a practical use case in mind? Perhaps an example related to quickly reading a file, rapid batch processing, etc? I always like to frame somewhat this abstract knowledge to something that I know or is more tangible to me.

  2. I like the C idea a lot! I will start thinking about that as I am well past just “being comfortable” writing Julia code. I should start thinking about how to make my basic writing of Julia code perhaps better - mostly held off from it due to how early optimization can stifle creativity or production.

  3. Why should I know Amdahl’s Law and love it? I read the wiki page some and I see that it is talking about threaded processes - where does this come to bear in day to day decision making about high performance systems?

  4. Pardon my ignorance but I didn’t know about vector registers! Time to do some investigating :male_detective:

  5. Right now, not too worried about GPU’s but someday, I will approach CUDA with CUDA.jl!

Really exciting thoughts and great directions so far! Thank you!!!

Here’s a quote from Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann:

”Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.

A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. ”

Integration with tools like databases, Apache Spark, etc. are very relevant in this field. On the other hand, this is quite far from solving differential equations.

Ah, so it’s basically applicable to anything you’d ever want to do in parallel in a distributed-memory context.

Reading/writing a big file? Use MPI_File_read_at and MPI_File_write_at to read and write in parallel (aside: MPI.jl generally follows the C api for MPI). Note of course that you have to know a fair amount about the structure of the file for this to work well, and you’ll want to be on a fast parallel (e.g. lustre, etc.) filesystem to not just get killed by IO bandwidth, but people do it with big HDF5 files on clusters all the time. Or, say, batch processing – each MPI task can just read a different file, then you do whatever you have to with that information. If you have more files then tasks, you can have one or more tasks work as a scheduler to coordinate the others.

Now those examples by themselves are almost too easy to count as HPC, but you could do it. If you’re going to do some sort of complicated calculation on that data though, then that can definitely count (e.g. Celeste.jl). Or say on top of that there were some degree of inter-process communication that you needed to coordinate (so, say, you needed to extract some different information from each file, share it with some neighboring tasks, and do something more complicated with that information), then you’re really going to be glad to be in an HPC environment with infiniband interconnects or whatnot, and will be glad to be managing message passing manually so that you only have to send the bare minimum, because even with that fancy interconnect, latency between nodes is huge compared to within the node.

It’s the hard limit on how much you can scale. If there is any part of your problem that is serial, Amdahl’s law tells you just how quickly that will come to dominate your runtime. Consequently, it can often tell you a priori whether you should even bother trying to scale a given code to 10 cores or 100 or 100,000.

2 Likes

Now of course, how applicable of any of this is to specifically data science definitely depends on what you mean by data science, but working with massive amounts of data is certainly common enough in HPC. To some extent it maybe doesn’t matter what the data is because ultimately everything on a computer is just numbers at a low level, and that low level is generally exactly where you want to be working to be efficient enough for it to be worth expending HPC resources.

Big memory is eating big data.

Basically, given you can get a 2TB RAM machine on AWS, chances are for batch workloads, you are almost better of using a bigger machine rather than multiple small machines.

The reason is simple, interms of speed, transferring data over the network is the slowest, followed by transferring data from disk to RAM.

If you can, do as much work as you can in RAM and do as little as you can over networks.

This should cover 99.99% of workloads. That’s why tools like {disk.frame}, vaex, and Dask exists.

If you go beyond that, you might want to look into Spark and if it’s more specialized then MPI.

What’s the deal with Spark? It works over a node and it compares itself to Hadoop. Hadoop was a tool to you use commodity hardware over a potentially faulty network. It was cool when you could get lots of cheap computers and string them together to process big data.

But… people have shown that it’s too slow! So Spark came in with pretty much the same model as Hadoop, except it does things in RAM rather than on disk as was the default for Hadoop. So on their front page Spark looks awesome, cos it’s like 10x faster than Hadoop! But what about against other tools? IF the data fits into RAM, Spark’s config nightmare and complexity is a no no vs something simpler like data.table.

Did I mention big memory is eating big data? If your data can sit on the harddrive on one computer, chances are you can get awesome data manipulation performance out of vaex, {disk.frame} on that one computer so you don’t have to worry networking issues!

Big rant over.

3 Likes

Yeah, I agree with that too. Even though memory per core is maybe not increasing the way it used to year-over-year (and has maybe even decreased in some cases?), total memory and memory per node are both still growing quite well.

It’s actually quite hard to have a dataset that does not all fit in memory in HPC land when you have clusters with petabytes of ram.

Which I suppose also fits in with the theme that you generally don’t need HPC for “data science” in the ordinary use of the term. But that doesn’t mean you can’t do lots of things that arguably qualify as data science in the HPC realm – just more things where there is also some modelling or simulation or something otherwise computationally expensive you need to do with all that data.

@sjain123nahta regarding your comment about finding resources to build your own cluster, you can build cluster of Raspberry Pi to learn MPI

Or why not use a cloud provider? AWS/Microsoft/Google have free tiers to help prototyping and learning. AWS Parallel Cluster is capable of running on free tier instances .

What stops you learning MPI on your laptop ? You can either program using shared memory segments or there are plenty of Vagrant recipes out there which build a cluster with VMs on your laptop.

Here is Wee Archie

A FOSDEM presentation on Raspberry PI clustering

1 Like

I checked what Google cloud offers, and the largest memory available seems to be in m2-ultramem-416, which has 11 776 GB of it, and 416 virtual CPUs, for a standard hourly rate of $84.371 right now.

1 Like

GCP VMs are connected to disks using internal networking, so by definition the internal network is faster than disk access. I’m not that familiar with AWS, and I suppose it works differently?

I’ve seen this mentioned in tutorials, but I only found this as a direct link. Google Cloud Platform Blog: A look inside Google’s Data Center Networks

Like how big is ur data? Can u not just hire a big machinecfir a few hours if u have such big data needs? Business will pay for that

I don’t currently have such needs. My point is that handling “big data”, if it’s counted in the roadmap, has different needs than a CPU optimized program. Access to VMs with lots of RAM may change this, but they’re expensive as well.

I also learned of a term “small big data”, which would cover for instance being able to handle 100 GB of data on a laptop. Naturally it wouldn’t be that fast as with more RAM.

My tips

1.)
Find a data project that interests you and has a community benefit.

  • Wikidata - data cleaning / processing ( JSON Dump exists )
  • OpenStreetMap ( export to csv, or geo database )
  • Biodata … ( on your area )

2.)
If you have a minimum of experience with the previous task, try to make it harder.

  • process it on a low power/low memory machine (RasberryPi , … ) with a sharp bottleneck.
  • try to process faster on several parallel small machines.

3.)
Why?
If you can scale the hardware down, it will be easier to scale it up !

You need to feel the data in your gut !

You need to get the basic data best practices right,

  • filter the data at the very beginning of the data flow
  • you need to figure out how to parallelize the data flow ( e.g. create data cubes )

My favorite method:

  • split the data into lots of small packets ( data frames, cells )
  • and process them in parallel with the “gnu parallel” command.

And the next step is distributed processing ( with and without Julia )

Bonus: Watch the: RFC: Experimental API for may-happen in parallel parallelism by tkf · Pull Request #39773 · JuliaLang/julia · GitHub

1 Like

One way to proceed is to copy successful training from others. One of the best is fast.ai, which is produced and given away for free by Jeremy Howard. If you listen to his interview on the Lex Friedman podcast, he makes a couple of observations:

  • most analysts don’t process huge datasets that require networks of computers
  • most analysts are just working on their single workstation with a single GPU
  • deep learning is the most advanced technique available and it’s not too hard to create state of the art models today

The very first exercise has you train a model to recognize cat pictures (historical fun fact: the internet was built to share cat pictures).

The course currently uses Python. I believe he is also developing a swift version. There is also an effort to create a Julia version. See also forum post at fast.ai

So, one thing you could do would be to contribute to that.

Besides that, I think a “traditional” introduction do deep learning would start with tree models (CART), boosted trees, random forest, and support vector machines. All of those and more are available in the MLJ toolbox. One approach – a good one – would be to write tutorials that walk people through the MLJ toolbox. Now you might expect that has been done – and it has: Data Science Tutorials in Julia. You could add to that in either breadth or depth. Or, you might find the tutorials too advanced and create simpler baby-steps tutorials for absolute beginners in data science. You need to define your audience, find you niche, and go for it!

2 Likes

I love the thoroughness you gave to this answer! This really solidified in my mind not only what MPI is in general but also why MPI would be so useful to look into for these purposes. I really liked how you rooted the thoughts back to practical applications (such as batch processing of multiple files). Celeste.jl also seems like a really smart next step once I get more comfortable with how MPI works. The interconnects and infiniband talk definitely sounds more advanced but will cross that bridge when I get closer to a point of that degree of control.

Oh cool! I’ll be reading on this further - thanks for giving me the high level overview here on it.

That is an interesting spin on this problem area. I guess I am looking for more inexpensive (long term) costs than spending money for massive compute from third parties (like AWS) - or at least to minimize spending there.

Would you mind explaining a bit more about doing work “in RAM”? What do you mean by that in the context of a data science processing pipeline? Like reading in files, performing computation, etc?

Also, I have never heard of {disk.frame} or vaex! Are they Julia tools or does Julia provide interfaces to them?

Thanks :smiley:

Thanks @ImreSamu - I think this is actually a pretty decent roadmap for learning as is! I will explore it some more.

What is a data cube by the way? Haven’t come across that term before.

Would you mind explaining a bit more why this is your favorite methodology?

I have heard a lot of good things about fast.ai @blackeneth - not terribly interested in AI applications at the moment, but could be an interesting place to look into. Will listen to that podcast - thanks for the recommendation friend.