Hi all
Hope you are doing well! In alignment with both my professional and personal interests, I am slowly starting to turn my attention to the domain of High Performance Computing for the purposes of:
- Designing high throughput data pipelines in Julia
- Working with multi-threaded processes
- Handling the processing of larger than memory datasets
- Efficient applications of algorithms to time series data (on and offline)
- Maximizing use of hardware/software capabilities
If this sounds somewhat vague and hand-wavy, that is because it totally is. I am very much a beginner in this area and am trying to develop a beginner’s roadmap for learning the above material. If those are even the areas I should be thinking about in the realm of High Performance Computing for Data Science approaches.
I have been working through Performance slowly, learning more about binary data storage formats like HDF5 and Apache Arrow (my current favorite), interfacing with relational databases from Julia, and picking up tips and tricks here and there. However, as the total of this thread says, I am made this post to be somewhat of a nexus point to collate resources one could use to scope out a beginner’s roadmap for learning about High Performance Computing applied to aspects of data science.
For example, LoopVectorizations.jl looks amazing but I am not sure if it would be applicable to data science toolings. Parallel computing with Distributed looks good, but not sure how to effectively use it in conjunction with the awesome tools like DataFrames.jl and Arrow.jl. And the list continues.
So would anyone be willing to help comment on approaches to building a beginner’s roadmap, sharing what you wished you knew up front about High Performance Computing before diving in, and resources/strategies on learning best practices?
Thank you kindly and have a wonderful day!
Yours,
~ tcp
Additional Background
What do I mean by Roadmap?
What I mean by “roadmap” is a step by step learning approach to take in surmounting this problem. It could be along the lines of:
- Read the Julia Docs
- Check out packages X, Y, Z
- Go practice things at
Right now, I am trying to map out, to paraphrase a quote, what I know, what I don’t know, and what I don’t know that I don’t know.
What is meant by Data Science?
Honestly, I have left this intentionally vague as data science means SO much across the tech world. If you have experience in working in data science in any way with a focus on High performance computing, please share your thoughts here.
What is my background?
I am not a computer scientist! I come from the world of biomedical engineering, healthcare informatics, and academia - so my responses will be colored by that lack of knowledge.