Given a complex, large scale computation that spans a non trivial amount of data (Tens of GBs), I wish to split the work between multiple processes across several hosts in a cluster. Each host has dozens of processors and plenty of RAM.
To minimize data movement, I have three different types of invocations of non-serial work:
-
Standard run-this-on-any-available-worker. The amount of sent data is small compared to the amount of compute, so it is OK if this is run on a different host on the cluster. I actually have very few of these, can probably live without them.
-
Run on a different worker on the same host. Needed data is large, compared to the compute, but is available in a SharedArray, and I have dozens of threads to go through it.
-
Run the function on a different host. The sent data is small, but the function will allocate some large SharedArrays and spawn local work (type 2 functions).
A naive solution would be to have a two level tree of workers; a main master process, a per host supervisor process, which controls per host slave workers.
Questions:
A. Is there any Julia package which provides such a functionality? I am aware of Dagger but it doesn’t seem to match the above.
B. Suppose I wanted to create such a package myself, then. Naively, I would need to use some ClusterManager (in my cluster, SGE) to spawn the per-host supervisors, and then each of these would call a simple addprocs to create the local worker threads. Would this work?
C. I assume that creating a SharedArray on the supervisor process or any of the slave workers allows it to be efficiently accessed by any other process running on the same host. Sending the SharedArray through a channel or providing it as a parameter to a parallel function call would just work using shared memory. The documentation seems to imply this but doesn’t state this explicitly. Is this the case?
D. Similar question for atomic types such as locks. Sending them between processes on the same host would share the same object, allowing for coordinating work between these processes. Is that the case?
Note that in Python the answer for C and D is “no”; AFAIK one must create the shared data or atomic object before spawning the worker processes.
E. If I create a SharedArray on one host, then send it to another host, what is the semantics? Is this forbidden, or does the copy become a a SharedArray available to the processes in the second host? Would modifications be reflected back to the original host, or just be visible to the processes of the other host?
F. Similar question for atomic types (locks etc.). What happens if they are sent to another host? Forbidden, become an independent copy, or somehow implement distributed atomic functionality (shudder)?
Thanks,
Oren Ben-Kiki