Dagger's DArray MPI subpackage roadmap

Hello guys! Hope everyone is doing awesome and having a great year developing Julia code. I am writing this post to serve three purposes:

  • Inform what is available in terms of the DArray MPI API;
  • Conclude my Google Summer of Code project and
  • Present next milestones regarding the API and our expectations.

First of all i would like to thank my incredible mentors, with which I hope to continue to be working from now on, @jpsamaroo and @evelyne-ringoot. Without them I would never be able to develop Julia code at the level we did, thanks for this opportunity and for the attention.

Development so far

Base DArray

During the project we managed to completely transfer the legacy lazy DArray interface, which only performed operations when the results were needed, to the new eager API using the @spawn macro and spawn function. This generated PR#396, in it we updated all the delayed and thunk calls to the beforementioned eager API calls, as well as change the needed functions to leverage the advantages of the new implementation. With this, the DArray is ready to be used with all the base operations you would expect from an AbstractArray type.

Apart from that, it was needed to create a new distribution paradigm to support MPI-style programming, i.e, using a fixed size for all partitions, therefore we’ve created the general AbstractBlocks type along with AbstractMultiBlocks and AbstractSingleBlocks, they represent that each data partition can have multiple sizes or a fixed stablished size respectively. This changes were made on PR#408, together with changes to guarantee that DArray operations would always return a DArray.

MPI DArray

Given the new DArray shared memory interface, we started developing a distributed memory platform to manipulate arrays and matrices. Our first interest was to give users a possibility to distribute their data without having to specify the size of each partition, simply by using the distribute, rand, randn, ones and zeros functions with the the dimensions of the MPIBlocks parameter as nothing. This was a challenging aspect of the implementation, given that all the partitions should be the same size and the user can specify any number of ranks, the solutions that will be used is throwing a warning to suggest that the user specify their own dimensions on the MPIBlock or change the number of MPI ranks to one that is bigger than the one being currently used.

We already have distributed array reductions, using sum, prod and reduce with specified dimensions or to reduce the whole array, all the code mentioned is available on PR#422 and PR#407, as well as tests for MPI changes of the DArray interface.

Next milestones

Probably next week we will create a PR to merge distributed matrix multiplication using MPI, which is already done and working, along with a new automatic distribution scheme that favors the use of linear algebra algorithms, i.e, by setting the linalg flag as true on the new implementation the data will be distributed to favor square partitions in order to reduce communication costs between ranks.

We plan to have linear algebra factorizations using MPI as soon as possible, for that we will test the blocked and tiled implementation of the QR factorization given by the articles available on Arxiv and Hindawi in the near future, to determine which implementation paradigm would benefit more using distributing memory, keeping the focus on large scale architechtures. Once we have it settled, then the focus would turn to the implementation of SVD, LU and Cholesky factorizations. Having a good documentation is also a big concern, in order to facilitate use and further development, so once the base operations are merge ready we plan to have the respective documentation for the beformentioned operations.

How to contribute

Firstly, I strongly suggest looking into the base DArray files and understand it’s base operations and types, from that, if you want to contribute with the shared memory implementations, dive into the remaining DArray files. Similarly, if your desire resides in dealing with distribute memory using processes and MPI take a look at the MPI implementations, starting with understanding how to deal with domains and subdomains in the distribute and collect functions. Any issues can be reported directly to the Dagger repository on github, as well as PR’s with changes you deem necessary.

10 Likes

Any Initiatives for linear algebra implementations? Are they planned to be bindings to scalapack or implementations from scratch?

1 Like

Yes! I am currently working on the QR factorization, had a few setbacks but things are working better now. As for the implementation, I was planning on using bindings and wrappers, however the base factorization routines in Julia are wrappers already, so I planned on using them with the MPI.jl package.

1 Like

Is there still an opportunity to contribute? If so what are some of the things I can start with?