Who is using Dagger?

Hey all! I’m interested in finding out who in the community is using Dagger, or has used Dagger in the past. If so, I would appreciate if you could provide answers to any of the following questions:

  • What kind of problem(s) are you solving with Dagger?
  • Are you seeing the performance you would hope for from Dagger?
  • At what scale (laptop, server, cluster, supercomputer) do you plan to/currently run your problem at, with or without Dagger?
  • Do you use any unique Dagger features that you find really useful?
  • Do you use, or plan to use, Dagger’s GPU compute capabilities?
  • What is your highest priority need from Dagger (overall speed, composability, special features, table support, etc.)?

If you feel like Dagger is worthy, I’d love if you could provide a testimonial about Dagger and its potential as a computing platform!

I’m looking forward to your responses!

12 Likes

A post was split to a new topic: Package usage surveys

I have tried JuliaDB on and off several times but never got it working. I thought JuliaDB and Dagger are no longer maintained.

2 posts were split to a new topic: Dagger use cases?

I picked up Dagger maintenance about 2 years ago, when I joined the JuliaLab; previous to that, it was basically unmaintained, except for minor fixes for JuliaDB. Today, Dagger is very actively maintained by myself and @krynju.

JuliaDB, on the other hand, is still struggling to stay maintained, although people like @quinnj have done a lot of work to get it back up to date with the ecosystem and keep it functioning. But I would be hard-pressed to say that JuliaDB is actively maintained right now.

4 Likes

Being able to work with out-of-memory data is a huge deal-breaker in data science. I still don’t understand why JuliaDB gets such limited attention.

I haven’t used JuliaDB in a while (used to use Dagger via JuliaDB), but it’s nice to see that Dagger is actively maintained! Even though I’m not an active user / developer, here are my two cents.

I imagine that map, filter, reduce and groupby already cover a fair amount of usecases of distributed data processing, especially since reduce also works on grouped tables.

OTOH, and this could be typical of julia, a lot of features work out of the box and it may just be a matter of documenting that they do. I suspect that writing docs for things that just work by composability could be a simple way to “add more features for free”.

For example, I tried the following and it worked

julia> using Dagger, OnlineStats

julia> d = DTable((a = rand(100), b = rand(100)), 50);

julia> m = reduce(fit!, d, init=Variance());

julia> fetch(m)
(a = Variance: n=100 | value=0.0898072, b = Variance: n=100 | value=0.0796099)

So you can already compute summary statistics in a distributed way with one pass over the data, which is really nice but also hard to guess from the docs. I suspect this would also work with grouped data to compute grouped summary statistics.

I think Dagger could benefit with more “docs for end users” (as in tutorials and how-to-guides in the divio system), and with a clearer signaling of what docs are more “beginner-friendly” (in the current version, I’d say it’s mostly this section).

As a practical suggestion, other than the features you get from composability, things that IMO could be added to the docs are

  • a typical data-wrangling tutorial done with Dagger (that’s also a great way to see if features are missing), I’m familiar with this one but there are certainly many other options out there
  • a nice simple section for DArray that parallels the one for DTable, ensuring it has the same ease of use. For example, it was surprising to see that DTable((a = rand(100), b = rand(100)), 50) works but DArray(rand(100), 50) does not.

Hope this helps, and kudos for all the hard work on Dagger, it’s really coming along very nicely!

6 Likes

I agree with more docs for end users.

i would suggest documentation similar to Ray (dagger seems to be somewhere between dask and ray): What is Ray? — Ray v1.8.0

specifically, I think these would be helpful:
instead of “usage” being the first page, i’d suggest “gentle introduction” or “tutorial”
best practices page - patterns and antipatterns
how to interop with other julia packages and ecosystems

4 Likes

I’m a happy Dagger user although I mostly use it though the excellent FileTrees for quite simple and boring tasks compared to what it is designed for.

I like that I can use the convenience of FileTrees regardless of problem size and I use the three first scales you mentioned (laptop, server and cluster).

I do often get disappointed with the “low scale” performance, e.g. trying to use more cores on a laptop. This is in the context of quick one-off exploratory work, e.g I have abunch of files I want to turn into plots using this transform, lets see what happens if I use more threads/processes.

I don’t think this can be fully blamed on Dagger as it is probably a mixture of other things, e.g. easily becoming RAM limited. I rather think this can be seen as an opportunity and value in the stuff you are doing in Dagger to try to optimize the scheduling. Making the computation graph easy to view might be helpful here to make it easier for the user to spot places where they accidentally hamper parallelism.

When it comes to cluster scale, it is often a bumpy road to get it to work and it can be a bit janky for reasons which I don’t think one can blame Dagger on. I think one common failure more for me is a worker which for one reason or the other becomes unreachable and it brings the whole thing down. I think this is something the fault handler should ideally catch but it seems like exception often slips through.

As for unique features, I like the feature to add more processors over time as this happens to fit very well with the nature of the above type of work and the unpredictability of how the cluster scheduler is giving me resources.

I’m interested in the capabilities of DTable and just generally keeping partial results on multiple workers but I haven’t started to make use of it yet for external reasons.

3 Likes

I guess I count as a pretty heavy Dagger user due to the fact that I based the DTable implementation fully on it.
To me Dagger does the heavy lifting of the parallelization and memory management that projects like the DTable require. Without it we’d be reinventing the wheel every time someone wants to develop some more advanced threaded/distributed functionality that isn’t just a parallel loop.

I think it shows, from the development of the DTable and the issues appearing along the way, that we need more care put into not only Dagger, but Distributed as well.
It may not be the most popular area of the Julia ecosystem, but it’s very important as single machine performance can only get us so far.

I have been using Dagger on a single machine in threaded or mixed setups during the development of the DTable. I like to think that anywhere Distributed works Dagger works as well, but I haven’t tried a lot of other configurations (I’m interested in how it works together with kubernetes). That would probably be my favourite feature - that it just works and scales across different environments without much hassle

My top requested features would probably be:

  • to-disk caching of managed memory
  • better interface for adding tasks and data into the scheduler (e.g. batch scheduling)
  • worker/thread pools and better control over the scheduling (maybe some suggestions on spawn - leaf task, task returning big data/small data chunks etc.)

Overall I think Dagger (and related) needs more work and care on both code and documentation levels. It already is a great base for some packages and it will most likely do the heavy lifting for future packages/projects like the DTable.

Dagger also needs to be promoted better to reach the potential user base more effectively, which is already small due to threaded/distributed computing just being a pretty niche area of interest.

And we need a nice logo

6 Likes