Developing a Beginner's Roadmap to Learn Julia High Performance Computing for Data Science

Celeste is an interesting case because they don’t actually have an explicit dependency on MPI.jl, and it’s a bit tricky to figure out exactly what they were doing from looking at the source, but they do mention in interviews

we have integrated the DTree scheduler and utilized MPI-3 one-sided communication primitives

so they must have been hooking in to the cluster’s MPI at some level.

1 Like

/ ~ data cubes / Multidimensional Tiling / ~ partitioning/splitting the data by dimensions

for example, we have only 3 Dimensions:

  • Date/Time ( ~ like splitting by Month )
  • Spatial ( Quadtiles ; H3 ; S2 )
  • Genus / Species / Taxon

And each cell has a ~ 0…10000 trees
So we can split the data with small “data tiles” ( cubes )

so just generate a big tasklist to a ./bio_tasklist.sh

./julia_bio_task.sh 2021-01 ADBA Tilia
./julia_bio_task.sh 2021-01 ADBA Populus
./julia_bio_task.sh 2021-01 ADBA Fraxinus
./julia_bio_task.sh 2021-02 ADBA Tilia
./julia_bio_task.sh 2021-02 ADBA Populus
./julia_bio_task.sh 2021-02 ADBA Fraxinus
...

And run parallel: the "–jobs $(nproc) "== parallel tasks based on cpu counts.

time parallel --delay 2 --jobs $(nproc) --results ./jobs/bio_tasks  -k  < ./bio_tasklist.sh

see:

In the OSM there are some biological data - if you need :slight_smile:

2 Likes

What’s a processing pipeline? It’s a bunch of transformations and side-effects, like storing data into S3.

Data process is done by retrieval of data from somewhere e.g. DBMS, HDFS, etc. Then processed by something, which is often a Spark CLUSTER. So replace that Spark cluster with a single computer. So everything can happen in RAM. E.g. replace it with a robust tool. Myabe a single node Spark? Or even Dask, disk.frame, vaex or let the user choose.

There are technologies now which make large memory machines available and very fast access to data over networks, and composable infrastructure
Intel Optane persistent memory is a tier below RAM.

Dell can provide multi-terabyte single servers using persistent memory. I would have to check what the limits are.

You can have composable infrastrucures - where you build a high performance server on demand with a set of CPUs, GPUs and high memory - which can be torn down and redistributed for the next trainign or analysis run.
Dell work very closely with Liqid on this - have a look at this 12 Terabyte server

https://www.liqid.com/dell-technologies/solution-bundles/application/liqid-composable-high-memory-appliance

Also look at Bluefield

2 Likes

Comments about time to access data from network storage versus local storage on a server are getting a bit out of date.

Look at the Data Accelerator at the Dell Centre of Excellence in Cambridge

https://www.dell.com/support/kbdoc/en-us/000122853/dell-emc-data-accelerator-reference-architecture

Weka storage say they are FASTER than local disk

Also look at Intel DAOS
https://www.intel.co.uk/content/www/uk/en/high-performance-computing/daos-high-performance-storage-brief.html

1 Like

Oh so that’s what they’re using Optane for! Yeah, makes sense.

that’s amazing! Althought it’s 12TB of Optane so it’s a bit slower than DDRRAM but faster than SSDs. Optane is definitely what it is to SSD as what SSDs were to spinning hard drives.

you make it sound like the specialist architecture is ubiquitous of which it is not. Most spark cluster suffer from poor network performance vs fetching from disk. It will be many years before it can be considered “out-dated”.

@xiaodai Servers have a mic of DRAM and Optane memory - not all Optane. So depending on how much data you access at a time performance may not be affected highly. As usual YMMV

This is a subject that I am not expert in, however if there is interest I Can track down someone from Intel to comment.

Of course, all computers need DRAM to function due to the current architecture. So you mentioned optane only so I pointed to the optane part.