[ANN,RFC] DBCollections.jl – use Julia data manipulation functions for databases

aplavin · September 10, 2024, 9:57am

Imagine if you could use the exact same code to manipulate regular Julia collections and SQL tables… DBCollections.jl is the package to achieve that!

RFC: comments are welcome! Also, the naming is preliminary, feel free to suggest alternatives (:

There is no shortage of Julia packages that provide convenient interface to querying databases – see a comparison in the README.
DBCollections.jl is unique in that it allows you reuse the same code, the same functions without any modification for both in-memory collections (arrays, tables) and databases.
It turned out to be quite simple to implement, with just ~200 LOC without any macro definitions!

A couple of simple examples:

using DataManipulation, DBCollections

# data can either be a regular Julia collection or a DBCollection
@p let
    data
    filter(@o _.height > 180 && _.weight < 80)
    map(@o (name=_.name, ratio=_.weight / _.height))
    collect  # optional for Julia collections, required for DBCollections – returns a StructArray for them
end

@p let
    data
    map(@o (i=_.i, b=round(_.i/4.5)))
    group_vg(@o (a=_.i / 3.5,))
    map(@o (a=key(_).a, avg=mean(_.b), cnt=length(_)))
    collect
end

DBCollections.jl uses FunSQL.jl under the hood, and works with any database supported by it (SQLite, DuckDB, Postgres, MySQL, …).
A large set of operations is already supported by DBCollections:

map/filter/sort/Iterators.drop/first/…
group (from FlexiGroups.jl)
push!/append!/copy!

Coming soon:

join (from FlexiJoins.jl)
nested structures support, translated to JSON operations in SQL

gdalle · September 10, 2024, 11:15am

Can you comment on the link with the more generic tables interface?

mrufsvold · September 10, 2024, 11:34am

How does this compare to the TidierDB.jl implementation?

aplavin · September 10, 2024, 12:11pm

Sure, happy to comment – but not sure what’s the question, and “more generic” than what exactly? (:
DBCollections mimics the regular Julia iterator + collection interface, converting it to SQL. This interface is also supported by many in-memory Julia tables, fully or partially.
But this question is unrelated to DBCollections I think… Here, I don’t design any new interface, just enable using it for SQL databases.

aplavin · September 10, 2024, 12:19pm

I’ve looked into it a bit, not too familiar though. Afaict:

Tidierdb doesn’t allow using the exact same code for Julia datasets and for SQL. While DBCollections lets you write a single function/pipeline handling both.
Tests ensure that the result remains the same.
DBCollections follows the regular Julia interface, like map/filter supported by many collections, in contrast to…
… tidierdb designed for those coming from R, not from regular julia collections.
DBCollections defines no macros, basically all the interface is function-based and composable. While mimicking R in tidierdb requires very heavy macro use, and learning custom interpolation/escaping rules.

pdeffebach · September 10, 2024, 1:29pm

If all I know about an object is whether it implements the Tables.jl interface, will DBCollections.jl work out of the box?

aplavin · September 10, 2024, 1:31pm

What object do you mean?
DBCollections.jl defines the DBCollection type that works with standard Julia stuff like map/filter/… .
It doesn’t provide new data manipulation functions for existing Julia data or collection types.

TheCedarPrince · September 10, 2024, 1:33pm

I am just dropping by to say that this is a FANTASTIC package – I am a huge fan of FunSQL.jl and have built a number of tools on top of it. This looks truly promising!

One problem I have still been conceptually working on is how to best multi-thread or distribute some of these queries using Julia. Have you tried anything like that with your package yet?

aplavin · September 10, 2024, 1:55pm

What DBCollections does is enabling the use of regular Julia data manipulation functions with SQL tables. Execution of the resulting SQL queries is done by the underlying database engine, some are doing it in parallel – but that’s independent from the Julia interface.

If you are talking about parallelizing operations with Julia collections, this is useful but far out of scope here (:
I’ve been using OhMyThreads recently, they have nice parallel map… Is there something specific you are working on in this direction?

pdeffebach · September 10, 2024, 3:13pm

I provide the following example below. Would your package work with the MyTable object that I define below? I implement the Tables.jl interface for the object (as you can see by correctly calling DataFrame(t) at the end.

julia> struct MyTable
           nmes::Vector{String}
           data::Matrix{Float64}
       end;

julia> Tables.istable(::Type{MyTable}) = true;

julia> Tables.columnaccess(::Type{MyTable}) = true;

julia> Tables.columns(x::MyTable) = x;

julia> Tables.getcolumn(x::MyTable, i::Int) = x.data[:, i];

julia> function Tables.getcolumn(x::MyTable, nm::Symbol) 
           i = findfirst(==(nm), Symbol.(x.nmes))
           x.data[:, i]
       end;

julia> Tables.columnnames(x::MyTable) = Symbol.(x.nmes)

julia> using DataFrames

julia> t = MyTable(["a", "b"], [1.0 100.0 2.0 200.0])
MyTable(["a", "b"], [1.0 100.0 2.0 200.0])

julia> DataFrame(t)
1×2 DataFrame
 Row │ a        b       
     │ Float64  Float64 
─────┼──────────────────
   1 │     1.0    100.0

aplavin · September 10, 2024, 3:22pm

Hmmm, I guess I’m still missing something in your question…
DBCollections.jl doesn’t address processing data from Julian collections/tables/dataframes/vectors/… at all! Instead, it enables you to use existing data processing functions (like map/filter in Base) for sql tables.

Basically, it defines DBCollection type and methods for map/filter/… that take DBCollection. Nothing else!

So, if you are looking for processing data that’s already in julia structures – that’s probably not the package to look at.

pdeffebach · September 10, 2024, 3:36pm

Okay, so the answer to my question is that you do not define the Tables interface for the DBCollection object. Let’s say you have existing code which works on an in-memory Julia object, the way you work with that code must be restricted to map, filter, etc. If my existing code uses something like TableOperations.jl, which requires the Tables interface, I can’t easily port my existing work over to the DBCollection.

aplavin · September 10, 2024, 10:06pm

Indeed, the Tables.jl interface isn’t defined here.
It can be done, is there any specific scenario where that would be useful? Afaik, Tables.jl doesn’t provide much in terms of data manipulation – I personally use it for data I/O but not much else.

Yes, map and filter are the base Julia interface for data manipulation. Naturally, I focus on supporting them, as the most generic functions for in-memory datasets.

You are right, that won’t work out of the box. Not familiar with TableOperations.jl, but hard to imagine that their operations will work efficiently even if the Tables.jl interface will be defined for DBCollections…

Anyway, in practice it’s not too likely that some random code that processes an array would work as-is for an SQL database. The aim of DBCollections is to give a way to write code that works for both in-memory collections and databases.

It just happens that the generic data processing in Julia is already quite convenient – so what DBCollections does is just implement the SQL translation for these functions, without designing any new interface

mrufsvold · September 11, 2024, 11:18am

TableOperations.jl defines tons of split-apply-combine and select functionality that streamlines working with Tables, so if you were to implement the table interface, then a function that works on a DataFrame could also work on a DBCollection!

aplavin · September 11, 2024, 12:32pm

Added Tables support, was very straightforward: DBCollections.jl/src/DBCollections.jl at 00000000be784dfad350b90e8355f74655f19737 · JuliaAPlavin/DBCollections.jl · GitHub!
I tend to agree with you and @pdeffebach that it’s generally useful, at the very least for IO. Like, pass a DBCollection to CSV.write directly.

Still – maybe it’s just my unfamiliarity with dataframes/tableoperations, but I don’t see how this Tables interface for DBCollections can help with data processing. As I understand, applying TableOperations here would just load all the data into memory before any actual operations.

gdalle · September 11, 2024, 2:38pm

Sorry for dragging you into this Tables.jl mess. When I read

Imagine if you could use the exact same code to manipulate regular Julia collections and SQL tables…

I interpreted as:

You have existing code that works on database-like Julia collections, so it’s probably based on Tables.jl, and I allows you to extend it to SQL.

Instead the existing code was mostly higher-order functions like map or filter, as I see now.

pdeffebach · September 11, 2024, 2:42pm

I think @aplavin makes a good point though that there isn’t a whole lot of code that works on generic tables, meaning objects which only use the Tables.jl interface. People use DataFrames, TypedTables, etc. And even if there were code that operated generically on tables, it probably assumes the whole table fits in memory.

That said, with enough work it’s possible to implement the Tables interface for a data base that doesn’t naively load things into memory. But that’s more work that OP is not required to do.

aplavin · September 11, 2024, 8:10pm

Indeed, sorry for that minor confusion (:
Mentioning Tables ended up useful in the end: supporting that interface for DBCollections does make sense, for downstream IO.

Collection/iteration is the basis of the most generic fully-featured interface for data manipulation in Julia. Everything else is either heavily specific to a single type, or very limited in supported operations.
Lots of data structures, including many Tables tables, support this interface (to a varying degree) already. Now you can also use it with SQL databases

aplavin · September 11, 2024, 8:16pm

Exactly: I don’t think the Tables interface is enough to support many efficient and convenient operations on top of it.
Tables.jl is great for raw tabular IO and interop, I guess these considerations drove its design. For data manipulation?.. Definitely not the most convenient.

It’s possible to implement the Tables interface – and database libraries like SQLite.jl and others do that. But having the Tables interface doesn’t help making TableOperations efficient on them: it’s just the limitation of the interface.
Presumably, every operation has to be explicitly supported for efficiency. But then we don’t really gain anything on top of stuff like map and filter that already work – for many julia dataset types, and now for SQL databases as well.

xiaodai · September 17, 2024, 11:21pm

The first example already lost me, what’s @p what’s @o? A simple description would go a long way. Otherwise, I think the impression I get is that usability and understandability is not a priority atm. So it immediately signals to me as something I may not want to invest time in until those become priorities.