Best practice for data-parallel computation using DistributedArrays

There is a popular programming paradigm in distributed ML training called data parallelism, which I will briefly describe below. What is the best way to program it using DistributedArray (or anything else that might be a better fit)? The reason that I am asking is because I am working on a project that data parallelism is one use case that it’s supposed to make efficient and simple. I would like to better understand how my approach compares to best practices using existing tools in Julia.

Data parallelism: say I have two large arrays A and B. Both arrays are so large that neither can fit in a single machine’s memory. I would like to process array B’s elements one by one. For processing each element of B, I need to read and write to some elements of A and the indices to A depend on the value of the B element. For each element of B, the number of elements of A that I need to access is small enough that they can fit in one machine’s memory.

It seems most natural to create array A and B as two DistributedArrays. My understanding is that DistributedArrays allow my workers to read and write to elements stored in other processes. Am I correct? If I were to access 10 elements of A for processing one element of B, say A[1], A[20], A[84], A[143], …, is this going to cause 10 reads over the network to fetch the relevant values? If those 10 reads happen one after another, then the latency can be very long.

The question might be too long and ambiguous. :slight_smile: I think it would also be very helpful for me if someone could point me to some more sophisticated examples using DistributedArrays or other good distributed libraries so I could learn by myself.