I haven’t used JuliaDB in a while (used to use Dagger via JuliaDB), but it’s nice to see that Dagger is actively maintained! Even though I’m not an active user / developer, here are my two cents.
I imagine that
groupby already cover a fair amount of usecases of distributed data processing, especially since
reduce also works on grouped tables.
OTOH, and this could be typical of julia, a lot of features work out of the box and it may just be a matter of documenting that they do. I suspect that writing docs for things that just work by composability could be a simple way to “add more features for free”.
For example, I tried the following and it worked
julia> using Dagger, OnlineStats
julia> d = DTable((a = rand(100), b = rand(100)), 50);
julia> m = reduce(fit!, d, init=Variance());
(a = Variance: n=100 | value=0.0898072, b = Variance: n=100 | value=0.0796099)
So you can already compute summary statistics in a distributed way with one pass over the data, which is really nice but also hard to guess from the docs. I suspect this would also work with grouped data to compute grouped summary statistics.
I think Dagger could benefit with more “docs for end users” (as in tutorials and how-to-guides in the divio system), and with a clearer signaling of what docs are more “beginner-friendly” (in the current version, I’d say it’s mostly this section).
As a practical suggestion, other than the features you get from composability, things that IMO could be added to the docs are
- a typical data-wrangling tutorial done with Dagger (that’s also a great way to see if features are missing), I’m familiar with this one but there are certainly many other options out there
- a nice simple section for
DArray that parallels the one for
DTable, ensuring it has the same ease of use. For example, it was surprising to see that
DTable((a = rand(100), b = rand(100)), 50) works but
DArray(rand(100), 50) does not.
Hope this helps, and kudos for all the hard work on Dagger, it’s really coming along very nicely!