A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Regarding DTables.jl: it hasn’t seen much focused development, so advanced features like query optimization haven’t been a focus when time is spent on development. But that doesn’t mean we couldn’t make it work with a bit of restructuring. I’m interested in various kinds of DAG-to-DAG optimizations for Dagger as well, so maybe those things can tie in together?

I think the thing that’s missing from DTables is really better support for nice APIs like what DataFrames has. If users could drop a DTable into code which already works with DataFrames, that makes their lives much easier and makes maintenance of DTables-enabled code not significantly more work than just having DataFrames-enabled code.

Regarding in-memory vs. out-of-core vs. distributed - you really don’t want to have a sharp transition when going from one to another, as otherwise you have to maintain API surface targeting all three separately, and users have to learn to deal with the different APIs and their peculiarities (and when it starts to get cumbersome, they simply won’t spend the effort). This is why I’ve developed Dagger’s own APIs to not be specific to distributed or multi-threaded execution - you just use the simple API, and Dagger does the rest. The same applies to out-of-core, in a sense, as you can have Dagger seamlessly swap data to disk, or use data lazily loaded from disk as an input, and this should work with tables just as well as it works for arrays, without any changes to your analyses or algorithms.

1 Like