Hello guys! Hope everyone is doing awesome and having a great year developing Julia code. I am writing this post to serve three purposes:
- Inform what is available in terms of the
DArray
MPI API; - Conclude my Google Summer of Code project and
- Present next milestones regarding the API and our expectations.
First of all i would like to thank my incredible mentors, with which I hope to continue to be working from now on, @jpsamaroo and @evelyne-ringoot. Without them I would never be able to develop Julia code at the level we did, thanks for this opportunity and for the attention.
Development so far
Base DArray
During the project we managed to completely transfer the legacy lazy DArray
interface, which only performed operations when the results were needed, to the new eager API using the @spawn
macro and spawn
function. This generated PR#396, in it we updated all the delayed
and thunk
calls to the beforementioned eager API calls, as well as change the needed functions to leverage the advantages of the new implementation. With this, the DArray
is ready to be used with all the base operations you would expect from an AbstractArray
type.
Apart from that, it was needed to create a new distribution paradigm to support MPI-style programming, i.e, using a fixed size for all partitions, therefore we’ve created the general AbstractBlocks
type along with AbstractMultiBlocks
and AbstractSingleBlocks
, they represent that each data partition can have multiple sizes or a fixed stablished size respectively. This changes were made on PR#408, together with changes to guarantee that DArray
operations would always return a DArray
.
MPI DArray
Given the new DArray
shared memory interface, we started developing a distributed memory platform to manipulate arrays and matrices. Our first interest was to give users a possibility to distribute their data without having to specify the size of each partition, simply by using the distribute
, rand
, randn
, ones
and zeros
functions with the the dimensions of the MPIBlocks
parameter as nothing
. This was a challenging aspect of the implementation, given that all the partitions should be the same size and the user can specify any number of ranks, the solutions that will be used is throwing a warning to suggest that the user specify their own dimensions on the MPIBlock or change the number of MPI ranks to one that is bigger than the one being currently used.
We already have distributed array reductions, using sum
, prod
and reduce
with specified dimensions or to reduce the whole array, all the code mentioned is available on PR#422 and PR#407, as well as tests for MPI changes of the DArray
interface.
Next milestones
Probably next week we will create a PR to merge distributed matrix multiplication using MPI, which is already done and working, along with a new automatic distribution scheme that favors the use of linear algebra algorithms, i.e, by setting the linalg
flag as true on the new implementation the data will be distributed to favor square partitions in order to reduce communication costs between ranks.
We plan to have linear algebra factorizations using MPI as soon as possible, for that we will test the blocked and tiled implementation of the QR factorization given by the articles available on Arxiv and Hindawi in the near future, to determine which implementation paradigm would benefit more using distributing memory, keeping the focus on large scale architechtures. Once we have it settled, then the focus would turn to the implementation of SVD, LU and Cholesky factorizations. Having a good documentation is also a big concern, in order to facilitate use and further development, so once the base operations are merge ready we plan to have the respective documentation for the beformentioned operations.
How to contribute
Firstly, I strongly suggest looking into the base DArray
files and understand it’s base operations and types, from that, if you want to contribute with the shared memory implementations, dive into the remaining DArray
files. Similarly, if your desire resides in dealing with distribute memory using processes and MPI take a look at the MPI implementations, starting with understanding how to deal with domains and subdomains in the distribute
and collect
functions. Any issues can be reported directly to the Dagger repository on github, as well as PR’s with changes you deem necessary.