A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

jpsamaroo · September 16, 2024, 5:18pm

Regarding DTables.jl: it hasn’t seen much focused development, so advanced features like query optimization haven’t been a focus when time is spent on development. But that doesn’t mean we couldn’t make it work with a bit of restructuring. I’m interested in various kinds of DAG-to-DAG optimizations for Dagger as well, so maybe those things can tie in together?

I think the thing that’s missing from DTables is really better support for nice APIs like what DataFrames has. If users could drop a DTable into code which already works with DataFrames, that makes their lives much easier and makes maintenance of DTables-enabled code not significantly more work than just having DataFrames-enabled code.

Regarding in-memory vs. out-of-core vs. distributed - you really don’t want to have a sharp transition when going from one to another, as otherwise you have to maintain API surface targeting all three separately, and users have to learn to deal with the different APIs and their peculiarities (and when it starts to get cumbersome, they simply won’t spend the effort). This is why I’ve developed Dagger’s own APIs to not be specific to distributed or multi-threaded execution - you just use the simple API, and Dagger does the rest. The same applies to out-of-core, in a sense, as you can have Dagger seamlessly swap data to disk, or use data lazily loaded from disk as an input, and this should work with tables just as well as it works for arrays, without any changes to your analyses or algorithms.

jpsamaroo · September 16, 2024, 5:22pm

Also, an alternative approach: you can just use DataFrames with DArray columns, which would get you many of the advantages of using Dagger without dealing with the current issues in DTables. DArrays see the most development, since they’re the most general data structure and suit the widest number of use cases (and being built-in to Dagger makes it less burdensome to maintain). You can expect to see their capabilities and performance increase in the near future and beyond.

aplavin · September 16, 2024, 6:07pm

See also [ANN] SQLCollections.jl – use Julia data manipulation functions for databases for a somewhat complementary direction. SQLCollections allows using regular Julia operations on SQL databases, nice for big/out of memory/remote data.
Should be fully composable!
Some features are possible but not implemented yet (it’s just ~200 LOC) – such as joins or using arbitrary Julia UDFs.

It would be nice to understand and list specific limitations of such an approach of using SQL databases compared to querying pure-Julia structures, as @xiaodai plans. Some of them can realistically be possible to overcome indeed!

era127 · September 16, 2024, 7:22pm

In duckdb, it is possible to reuse sql with Macros, and it is also possible to define table functions in Julia, using create_table_function.

Also scalar udf functions written in Julia can be used with duckdb (using @cfunction) but I’m not sure the scalar udf c api is documented in the client api. For example, the collatz example for the rhai extension can be implemented as a Julia function and then function can be registered in duckdb.

drizk1 · September 16, 2024, 8:54pm

are there examples you might be aware of showing how to use @cfunciton to register custom functions in julia? so far I am coming up short and unable to register a function.

Topic		Replies	Views
Disk based data manipulation framework needed Data data	22	3995	November 19, 2018
Financial Data Wrangling / Cleaning / Joining / Filtering of large DataFrame tables Jobs	0	475	August 18, 2021
In-memory and large memory compute - and useful Julia features? Offtopic	0	546	July 15, 2019
Future directions for DataFrames.jl Data package , dataframes	47	6513	June 3, 2022
Struggling with Julia and large datasets General Usage question , big-data	67	11045	October 17, 2024

A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Related topics