A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Regarding DTables.jl: it hasn’t seen much focused development, so advanced features like query optimization haven’t been a focus when time is spent on development. But that doesn’t mean we couldn’t make it work with a bit of restructuring. I’m interested in various kinds of DAG-to-DAG optimizations for Dagger as well, so maybe those things can tie in together?

I think the thing that’s missing from DTables is really better support for nice APIs like what DataFrames has. If users could drop a DTable into code which already works with DataFrames, that makes their lives much easier and makes maintenance of DTables-enabled code not significantly more work than just having DataFrames-enabled code.

Regarding in-memory vs. out-of-core vs. distributed - you really don’t want to have a sharp transition when going from one to another, as otherwise you have to maintain API surface targeting all three separately, and users have to learn to deal with the different APIs and their peculiarities (and when it starts to get cumbersome, they simply won’t spend the effort). This is why I’ve developed Dagger’s own APIs to not be specific to distributed or multi-threaded execution - you just use the simple API, and Dagger does the rest. The same applies to out-of-core, in a sense, as you can have Dagger seamlessly swap data to disk, or use data lazily loaded from disk as an input, and this should work with tables just as well as it works for arrays, without any changes to your analyses or algorithms.

1 Like

Also, an alternative approach: you can just use DataFrames with DArray columns, which would get you many of the advantages of using Dagger without dealing with the current issues in DTables. DArrays see the most development, since they’re the most general data structure and suit the widest number of use cases (and being built-in to Dagger makes it less burdensome to maintain). You can expect to see their capabilities and performance increase in the near future and beyond.

1 Like

See also [ANN,RFC] DBCollections.jl – use Julia data manipulation functions for databases for a somewhat complementary direction. DBCollections allows using regular Julia operations on SQL databases, nice for big/out of memory/remote data.
Should be fully composable!
Some features are possible but not implemented yet (it’s just ~200 LOC) – such as joins or using arbitrary Julia UDFs.

It would be nice to understand and list specific limitations of such an approach of using SQL databases compared to querying pure-Julia structures, as @xiaodai plans. Some of them can realistically be possible to overcome indeed!

1 Like

In duckdb, it is possible to reuse sql with Macros, and it is also possible to define table functions in Julia, using create_table_function.

Also scalar udf functions written in Julia can be used with duckdb (using @cfunction) but I’m not sure the scalar udf c api is documented in the client api. For example, the collatz example for the rhai extension can be implemented as a Julia function and then function can be registered in duckdb.

1 Like

are there examples you might be aware of showing how to use @cfunciton to register custom functions in julia? so far I am coming up short and unable to register a function.