A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

jpsamaroo · September 16, 2024, 5:18pm

Regarding DTables.jl: it hasn’t seen much focused development, so advanced features like query optimization haven’t been a focus when time is spent on development. But that doesn’t mean we couldn’t make it work with a bit of restructuring. I’m interested in various kinds of DAG-to-DAG optimizations for Dagger as well, so maybe those things can tie in together?

I think the thing that’s missing from DTables is really better support for nice APIs like what DataFrames has. If users could drop a DTable into code which already works with DataFrames, that makes their lives much easier and makes maintenance of DTables-enabled code not significantly more work than just having DataFrames-enabled code.

Regarding in-memory vs. out-of-core vs. distributed - you really don’t want to have a sharp transition when going from one to another, as otherwise you have to maintain API surface targeting all three separately, and users have to learn to deal with the different APIs and their peculiarities (and when it starts to get cumbersome, they simply won’t spend the effort). This is why I’ve developed Dagger’s own APIs to not be specific to distributed or multi-threaded execution - you just use the simple API, and Dagger does the rest. The same applies to out-of-core, in a sense, as you can have Dagger seamlessly swap data to disk, or use data lazily loaded from disk as an input, and this should work with tables just as well as it works for arrays, without any changes to your analyses or algorithms.

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6830	July 13, 2017
ANN: JuliaDB.jl Community	40	9856	November 13, 2018
DataTables or DataFrames? Data question	32	15489	November 19, 2018
Does the concept of type-stability apply to DataFrames or Tables? Data dataframes , tables , type-stability	21	5860	October 17, 2017
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7523	August 21, 2020

A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Related topics