Ray seems to be an interesting project. In some ways, it is similar to Dagger, although its approach to scheduling and assigning work is noticeably different:
- Ray schedules work from node to worker, Dagger schedules from master to worker (no node abstraction in between for now).
- Ray’s object store can use shared memory between workers on the same node, Dagger’s object store (MemPool) probably does not, but I haven’t dug into those details too much so far.
- Ray relies on many external (non-Python) dependencies, like Redis and Bazel, while Dagger is pure-Julia.
- Ray exposes direct access to its datastores via put/get, Dagger does not (although it could).
- Ray has explicit GPU support, Dagger does not, but see below for future directions.
- Dagger has less specific integrations (like ML or GPU support).
- Ray is very well documented (from a light perusal of the docs), Dagger’s docs… suck.
- Ray (probably) has many more active contributors than Dagger, and has a larger ecosystem to draw contributors from.
Otherwise, the two projects are very similar in usage and semantics (eerily so, in my opinion). One is not necessarily better than the other, and both have plenty of potential. Since I don’t know much more about Ray, here’s what I see as the future for Dagger:
- More direct access by users to MemPool’s datastore
- Knowledge about node layout and features made available to the scheduler
- Specific details about latencies (network, thunk/job, scheduling, etc.) “learned” by the scheduler to improve efficiency and occupancy
- Improved scheduling efficiency overall (there are some low-hanging fruit still)
And GPU support gets its own section:
GPU support for distributed computing is non-trivial because of how different they behave compared to CPUs. They have an extra hop for data transfer, do not currently have great support for global memory access in most operating systems (compared to CPUs), handle certain workloads much more or less efficiently than CPUs, have a variety of “weird”, niche instructions and storage formats, etc. Additionally, the layout of GPUs in a compute cluster is not as obvious or as easy to detect as CPUs, since they are attached devices (not usually integrated into the motherboard or CPU chipset, although sometimes they are!).
That said, GPU support in Dagger is something I’ve been thinking on here and there. The linked PR is a start at solving this, but I don’t believe it actually touched the scheduler at all, so its efficiency will be hard to predict and manage. In order for GPU support in Dagger to be efficient, Dagger needs to be able to figure out where GPUs exist (on which nodes), what their capabilities are, and how well the user’s tasks run on a GPU vs. a CPU. Communicating this information “by hand” is probably infeasible for anything other than basic matrix multiplications, and so I believe that allowing Dagger to benchmark equivalent CPU and GPU kernels will be key (and is something that both CuArrays/ROCArrays and GPUifyLoops can make possible). However, I haven’t gotten to the point of figuring out what that interface will look like, so this is all just hand-waving; I don’t have the slightest idea yet of how to implement this either.