I am attempting to implement multinode tensor contraction, and I am running in to some issues. My current approach is, given two input tensors A and B and an output tensor C (all
DArrays), to spawn on each worker a loop over all workers querying for their localparts of B, and contracting with the localpart of A, then using
@spawnat to add the output chunk to the appropriate part of C. This approach works in that it gives the correct result, but benchmarking on a two node setup shows that each process is either able to be doing a contraction or addition, or sending/receiving data, but not both at the same time.
What is the best way to achieve simultaneous I/O operations and compute operations on a mutlinode setup?
Specifically, I would like a worker to be:
a) running a computation involving shards of a DArray
b) if queried for its localpart, return that
c) if asked to add to its localpart of C (result array), to do so
where b) and c) are happening simultaneously with a).