Only partly joking â you really canât go wrong with learning MPI.
While for very very large problems you may want to go with some sort of hierarchical parallelism approach with a shared memory model within nodes and a distributed memory model between nodes, itâs really the shared part that is optional. Latency between nodes is killer, and shared memory generally really doesnât scale well beyond a single node. And if you want granularity in deciding exactly what information critically needs to be shared where, MPI is really the de facto way to do that.
Some other more general advice IMHO
- Write like youâre writing in C. Do things manually as much as possible.
- Amdahlâs law: know it, love it
- Do use vector registers, if youâre on CPU. LoopVectorization is awesome for that. If your problem isnât amenable to that, make it amenable.
- While Iâve tried to avoid having to use GPUs, thatâs probably a losing proposition long term. Especially given that âXeon Phiâ and such has not really taken off, and systems like Cori may be the last serious supercomputers to use them. Most (I suspect all) of the Exascale systems in development are getting a substantial majority of their flops from GPUs. Which is another reason Julia is great because Iâd so much rather use CUDA.jl than raw CUDA. (That said, I still think there will be a niche for CPU-only compute for a long time yet â quantum chemistry anyone?)