There is a popular programming paradigm in distributed ML training called data parallelism, which I will briefly describe below. What is the best way to program it using DistributedArray (or anything else that might be a better fit)? The reason that I am asking is because I am working on a project that data parallelism is one use case that it’s supposed to make efficient and simple. I would like to better understand how my approach compares to best practices using existing tools in Julia.
Data parallelism: say I have two large arrays A and B. Both arrays are so large that neither can fit in a single machine’s memory. I would like to process array B’s elements one by one. For processing each element of B, I need to read and write to some elements of A and the indices to A depend on the value of the B element. For each element of B, the number of elements of A that I need to access is small enough that they can fit in one machine’s memory.
It seems most natural to create array A and B as two DistributedArrays. My understanding is that DistributedArrays allow my workers to read and write to elements stored in other processes. Am I correct? If I were to access 10 elements of A for processing one element of B, say A[1], A[20], A[84], A[143], …, is this going to cause 10 reads over the network to fetch the relevant values? If those 10 reads happen one after another, then the latency can be very long.