I am attempting to implement multinode tensor contraction, and I am running in to some issues. My current approach is, given two input tensors A and B and an output tensor C (all `DArrays`

), to spawn on each worker a loop over all workers querying for their localparts of B, and contracting with the localpart of A, then using `@spawnat`

to add the output chunk to the appropriate part of C. This approach works in that it gives the correct result, but benchmarking on a two node setup shows that each process is either able to be doing a contraction or addition, or sending/receiving data, but not both at the same time.

What is the best way to achieve simultaneous I/O operations and compute operations on a mutlinode setup?

Specifically, I would like a worker to be:

a) running a computation involving shards of a DArray

b) if queried for its localpart, return that

c) if asked to add to its localpart of C (result array), to do so

where b) and c) are happening simultaneously with a).