Dagger Development and Roadmap

Dagger Development and Roadmap

Hey all! As Dagger.jl’s maintainer, I’d like to give an update on where Dagger stands this year and what the plans are for the rest of 2023!

Aside: I’ll be at JuliaCon this year, so come find me if you want to discuss anything about Dagger or parallelism in Julia! I’ll also be giving two Dagger talks, one at the Data minisymposium, and one at the HPC minisymposium, so come listen in to hear what Dagger can do for you!

Developments so far:

GSoC and the DArray

This year we’ve got a Google Summer of Code student @fda-tome working on Dagger’s DArray, with a focus on updating its internal implementation, adding MPI support, and implementing more linear algebra operations. He’s been very successful with this so far, with the DArray now in a much better place in terms of matching Julia’s AbstractArray interface, and with MPI support rapidly approaching feature parity with its Distributed support. Some of these changes are going to be a part of my HPC minisymposium talk, so come check it out!

Documentation overhaul

Dagger’s documentation has historically been pretty unapproachable for new users and for experienced users alike, sometimes having too few details in important areas, and too much detail in areas that few people are interested in. I’ve spent some time working on the docs to improve this situation, with a new “Quickstart” introduction to Dagger on the first page, and a reordering of some documentation to line up related details in the same sections. The docs could always use more love, so if anyone is willing to help out or just point out where the docs could be improved, please reach out!

Improved GPU support

A few long-overdue changes have landed in DaggerGPU.jl lately, providing improved AMDGPU integration, Metal integration (thanks @Ronis_BR and Eric Hallahan!), and direct support for compiling and launching KernelAbstraction kernels. Together with Dagger’s new spawn_sequential task queue for in-order kernel launch (details in my HPC talk), utilizing GPUs with Dagger has never been easier!

Improved file/out-of-core support

I’ve been putting together a set of changes for MemPool.jl (Dagger’s storage and I/O dependency) which make it possible to utilize files as lazy inputs to Dagger tasks - a new set of APIs (Dagger.File and Dagger.tofile) will be landing in Dagger to make use of these new features, and will be wired into the DTable as well for easier and more efficient table ingest from files.

Thanks to changes by @krynju , it’s also become easier to configure out-of-core support via the new Dagger.enable_disk_caching! API, which makes it easy to configure out-of-core across multiple Julia processes; see the docstring for details!

Website

Dagger has a new website: https://daggerjl.ai. We’ve got some information on what Dagger is and why you’d want to use it, as well as a blog (where this post will also be available). The source is available at GitHub - jpsamaroo/www.daggerjl.com: Public website for Dagger.jl, so please feel free to add any Dagger-related content, including any blog posts and benchmarks!

Logo

Dagger also has a new logo!

I figured this was long overdue, and I’m quite happy with the result!

Roadmap for 2023:

Dagger 0.18

The release of Dagger 0.18.0 will ideally happen sometime before Thursday the 27th to coincide with the two Dagger talks I’ll be giving! This release will include any of the above work that hasn’t yet made it into a release. I’m bumping the minor version as the DArray is breaking its API to better match Julia’s AbstractArray API, but otherwise there shouldn’t be any other breaking changes since 0.17.0.

Machine Learning

As machine learning and AI has become substantially more relevant and powerful over the last year, it’s high time for Julia’s support for these technologies to improve. To achieve these goals, distributed model training and inference should be available and easy to use; to this end, I’m planning to work with the Flux/Lux maintainers and ecosystem to pick up DaggerFlux.jl development and add support for Distributed Data Parallel (DDP) and other parallelism strategies. I’m also keen to see strong AMDGPU support in DaggerFlux, as well as support for other accelerators that have the requisite ML operators. Please reach out if you’re interested in helping!

Improved mutation support

Dagger has always pursued the functional programming approach, where tasks are generally expected to operate out-of-place and allocate results to ensure reliable behavior in the face of automatic multithreaded and distributed execution. However, out-of-place operations aren’t always feasible when working with large data, which is especially visible to users of the DTable and DArray (which don’t currently have support for mutating tables/arrays in-place). It’s my intention to add more formal mutable data support to Dagger, and in doing so, allow the DTable and DArray to add in-place operators as seen in DataFrames.jl and most AbstractArray implementations, respectively.

Distributed graphs

After presenting on Dagger in Toronto this year at WAW23, and listening in on what users there were working on, I’m interested in implementing a Dagger-powered distributed graph abstraction, which would allow graph theory research and graph operations (like Graph NNs) to operate across multiple processes and automatically benefit from multithreading and possibly GPU support. I’m not amazingly well-versed with the best way to implement something like this, so if someone more graph-oriented is willing to work with me or take the lead, that would be amazing!

Intel GPUs, GraphCore IPUs, and other accelerators

With the impending deployment of the Aurora supercomputer (sporting all Intel GPUs) and support for GraphCore IPUs thanks to @giordano , I’m planning to add integrations to these and other accelerators in DaggerGPU soon! Intel GPU support should be particularly easy given that it has a reasonably mature array and KernelAbstractions backend, so if anyone wants to tackle this one, I’m happy to provide guidance!

Conclusion

This has been an excellent year for Dagger’s development, with many amazing possibilities not far away. I hope that as an ecosystem, we can push Dagger to become the best parallel programming API for many use cases and ensure that its performance is as good as possible.

I’m also very interested in hearing what people are excited to use Dagger for, and what features they’re interested in having! Dagger needs community support to thrive, so I’d encourage people to ask questions, provide suggestions, and talk about how their experience with Dagger has been!

42 Likes

It has been a massive pleasure to work on Dagger.jl this past months, I wanted to thank all the Julia community for being so welcoming, but specially to @jpsamaroo and @evelyne-ringoot for being with me and I hope that many more good things are yet to come :slight_smile: . Proud to be part of all this.

9 Likes

Awesome! I just can’t wait to integrate Dagger into our satellite simulator using the unified memory in M processors :slight_smile:

1 Like

I am always excited by the progress of dagger, but never manage to use it for something. Most of my use-cases relates to mixing dagger with Flux.

One example, where I think dagger might shine, might be spreading large language model over multiple GPUs, because the computation over the full model does not fit to a memory.

A MWE would be like

using Dagger
using Flux

m₁ = Dense(2,13)
m₂ = Dense(13,17)
m₃ = Dense(17,3)
c = Chain(m₁, m₂, m₃)

o₁ = Dagger.@spawn m₁(x)
o₂ = Dagger.@spawn m₂(o₁)
o₃ = Dagger.@spawn m₃(o₂)
fetch(o₃)

How do I achieve each sub-model will be on different GPU? Or on a different process? And minimize data movement?

Thanks,
Tomas

3 Likes

Great question! There are a few possibilities:

  1. Use Dagger.@mutable in hypothetical form m1 = Dagger.@mutable worker=2 Dense(2,13) to allocate the layer on worker 2, after which passing m1 to Dagger.@spawn will ensure the computation occurs there. The hypothetical form (with worker=pid) doesn’t yet exist, but it’d be an easy PR (it just needs to do a remotecall_fetch on the worker).
  2. Teach the scheduler to do the idea in (1) automatically - more general and less work for users, but will require some thought about how this will look.
  3. Implement a streaming model of computation in Dagger, and implement (2) - this better matches what you’ll really want to be able to do, which is to stream training/inference data through the model chain continuously or for a large amount of data.
  4. Use the idea in (1), but also teach Dagger to move data from one process to another if it becomes profitable to do so. This would also benefit the streaming idea of (3), and is potentially on the roadmap for this year.
  5. Add some kind of “execute elsewhere” logic to Dagger’s scope system, which would allow a scope to be computed based on another scope (so the scope for m2 would specifically not include the scope for m1, and so on for m3). This has questionable utility in the long run, but is maybe an interesting idea to think on.

The fact that none of these currently exist in Dagger is somewhat unsatisfactory, but also indicate that we have a lot of interesting possibilities (I can see options (1)-(4) being all worthwhile to implement, and having different tradeoffs).

EDIT: Idea (1) is now available with at-mutable: Add remote execution like shard by jpsamaroo · Pull Request #409 · JuliaParallel/Dagger.jl · GitHub, which will make it into 0.18!

2 Likes