Obviously there are no upper-layer applications .
As the user, It can’t be used to solve real problems.
I highest priority need DataFrames.jl on Dagger, OnlineStats.jl on Dagger, Transducers.jl on Dagger, To achieve spark’s or flink’s functions…
First, this JuliaHub tool isn’t a great metric to determine whether there are users of Dagger: GitHub - JuliaFolds/FoldsDagger.jl, GitHub - JuliaParallel/DaggerArrays.jl: Experimental Distributed Arrays package, and GitHub - JuliaGPU/DaggerGPU.jl: GPU integrations for Dagger.jl depend on Dagger, yet they don’t show up in “Direct Dependents”, and at a minimum, DaggerGPU is a registered package. Obviously this listing is inaccurate and needs fixing. (Edited above: This is a JuliaHub tool, not a GitHub tool, and pointed out that DaggerGPU is the only registered package).
Second, just because DataFrames, OnlineStats, etc. don’t directly depend on Dagger, that doesn’t mean that Dagger can’t be used to solve real problems (which is a rather rude statement to make, but I’ll ignore that).
Dagger’s DTable
, for example, wraps other tables and distributes them, which thus doesn’t require the ecosystem to depend on Dagger directly; instead, users can use any table type they want to back their distributed table. You can then use OnlineStats to do processing of that distributed table, computing statistics and performing aggregations, because the DTable
is composable. No Dagger dependency in OnlineStats needed!
Similarly, Dagger.@spawn
can wrap basically any function, and doesn’t require the function to know anything about Dagger to work; this is the same as how Threads.@spawn
doesn’t always require the spawned function to know that it’s running on a different thread; the composable nature of Julia allows both of these spawning mechanisms to “just work”.
It’s true that Dagger does need to see further ecosystem uptake to become a better composable foundation for distributed computing, but it’s still possible to use it today to solve problems. My original post that this thread was split from was intended to query the community to find out how that’s currently working out, which will help me inform Dagger’s development.
That’s not a GitHub screenshot, it’s a JuliaHub one. It only tracks other registered packages, which is a small sliver of usages. Indeed, the package stats show it’s been downloaded 5000 times by 2000 unique IPs since Sept 1.
@zsz00, your initial comment here is far harsher than necessary; there’s no need to be so confrontational — especially when the main developer is asking how to make the package better for its current users.
I guess it’s strange that DaggerGPU does not show up on the dependents page though? OTOH, Dagger does show up on the dependencies page of DaggerGPU https://juliahub.com/ui/Packages/DaggerGPU/5ydcV/0.1.2?page=1. Maybe it’s trimmed off by the heuristics on JuliaHub or something?
Sorry, updated the text to point out that this is a JuliaHub tool! Also pointed out @tkf 's point that DaggerGPU should show up.
Hello.
I like your answer a lot, in particular because finally, after a lot of research, I got a hint on how to use DTable.
I would say that as a beginner user the main problem I have found on using Dagger, and in particular DTable, is the absence of more or less complete examples that I can use to guide my own investigations.
Unfortunately the architecture of Julia, from my perspective, makes it a lot more complicated to start walking: there are simply too many ways of doing the same thing, and the documentation tends to focus on each package individually, rather that in the integration between packages…unfortunately that does not help my productivity (as a beginner)…so, in my opinion, having some curated examples where that integration is highlighted would help me a lot to understand how things work together. I am sure that once I get the knack of the language I will fall in love with it, as some of my colleagues truly are…but so far the learning curve is not easy.
Now…coming back to the use cases for dagger, I would say that parallelizing the analysis of large collections of electron microscope images is a great use case. Typical datasets are about 60 to 100 GB in memory, but there is a natural unit of aggregation (lets say one photo), that is typically about 50 MB to 100 MB. Ideally one would have a workflow that treats the problem in a way that loads the chunks independently, process them, and then write the results back to another file. Actually, what I am trying to do at the moment is related to that. I am trying to connect apache arrow in streaming mode to DTable, to let me explore little toy models around that problem, a bit inspired by how Rust Polars works…but as I mentioned above process is a little bit like a baby crawl…mostly because in my opinion the documentation is not really targeting my (beginners) user group.
Sincerely
Javier