[ANN] SQLCollections.jl – use Julia data manipulation functions for databases

aplavin · September 18, 2024, 12:34am

Totally, that’s fair! For now, DBCollections is an experiment: can we just use existing Julia data manipulation functions for working with SQL databases?
Turns out the answer is “yes”, as DBCollections demonstrate. That’s why this thread!
But it’s definitely not a fleshed out implementation, needless to say about sparse documentation.

I’ve been using DBCollections for DuckDB and MySQL databases during the past weeks, and find it convenient and useful. Doesn’t mean it will be useful for you specifically!

Sorry, I jumped straight into what DBCollections does in the first post, and didn’t explain the functions/macros from other packages. The reason is they aren’t specific to databases and defined elsewhere: @p in DataPipes.jl, @o in Accessors.jl.

One can use DBCollections without this syntactic sugar, it’s just less convenient. For example, filter(Base.Fix2(>, 0) ∘ PropertyLens(:a), data) works whenever data is a regular Julia collection of elements: it selects those x with x.a > 0. With DBCollections, the same code works for data being an SQL table!
@o simply lets you write the inner function in a more readable way: filter((@o _.a > 0), data).
And @p gets you composable piping, @p data |> filter(@o _.a > 0) |> ....

Again, both aren’t directly related to DBCollections, just convenient to be used together.

liuyxpp · September 18, 2024, 1:48am

I am also very confused by the @p and @o. I thought they come from DBCollections.jl. But it seems I am wrong. It is a bad idea to show code that depends on other deps without any explanations, especially for those macros since they kind of doing some magic things.

aplavin · September 18, 2024, 11:29am

The idea was to demonstrate:

Here is existing code that already works for regular Julia collections (unrelated to DBCollections)
We want to use the exact same code for SQL databases
Solution: just pass a DBCollection instead of a Julia collection!

Ie, I don’t design any new API in this thread, only add SQL support to existing data manipulation functions.

Judging by the comments, this didn’t really come clearly at first. Sorry! Hope now it’s more understandable with these explanations (:
Feel free to suggest better wording for documentation.

aplavin · September 22, 2024, 5:37pm

Well, I’ve been enjoying using DBCollections recently, it feels convenient and natural indeed!
Also, still remain impressed by Julia composability and how simple the implementation is.

Planning to register it in General soon, probably under the name of SQLCollections.jl. Then will make some more posts here with examples and neat synergies (:

drizk1 · September 23, 2024, 2:27pm

Thought I’d just share my thoughts on the differences briefly as the author of TidierDB.jl. I remember seeing FunSQL for the first time thinking wow that’s p slick, and SQLcollections seems nice too.

Defining its own macros means that TidierDB is actually quite flexible.

TidierDB recreates TidierData.jl giving users one syntax for DataFrames and SQL databases - users can write pipelines that work on a DB or DF.
The different parsing approach means TidierDB already supports just about any SQL function that’s built into a database as well as UDFs, in addition to DataFrame manipulation supported by the Tidier.jl meta package.
- This means if a user has DuckDB as their backend, all of these aggregate functions are available through Tidier syntax (and the same applies for the other 10 supported DB backends) in addition to all other preexisiting functions.
- pseudocolumns, such as rowid, support is readily available
- window functions such as row_number() are supported
DB specific styles can effortlessly blended with Tidier syntax (ie . chaining in DuckDB

A few examples:

using any built in aggregate function

@chain t(mtcars) begin
      @group_by cyl
      @summarize(kurt = kurtosis(mpg))
      @collect
end

using preexisting db functions ( ie lag and lead functions from DuckDB)

@chain t(ta) begin
    @group_by serial_number trip
    @window_order timestamp
    @mutate begin
        lag_a = lag(a, 1)
        lag_b = lag(b,1)
        lead_a = lead(a, 1)
        lead_b = lead(b, 1)
    end
    @arrange a
    @aside @show_query _
    @collect 
end

use existing DuckDB json functions

# use existing DuckDB json functions
@chain db_table(db, :example2) begin
        @mutate begin 
             fam = json_extract_string(jcol, "\$.family")
             spec = json_extract(jcol, "\$.species")
             has_family = json_exists(jcol, "\$.family")
             habitat_type = json_value(jcol, "\$.habitat.type")
             habitat_location = json_extract_string(jcol, "\$.habitat.location")
         end
         @select(id, fam, spec, has_family, habitat_type, habitat_location)
         @mutate(test = spec[0])
         #@aside @show_query _
         @collect
end
4×7 DataFrame
 Row │ id     fam       spec                     has_family  habitat_type  habitat_location  test    
     │ Int32  String?   String                   Bool        String        String            String  
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │     1  anatidae  ["duck","goose","swan"]        true  "wetland"     Europe            "duck"
   2 │     2  felidae   ["lion","tiger"]               true  "savannah"    Africa            "lion"
   3 │     3  canidae   ["wolf","fox"]                 true  "forest"      North America     "wolf"
   4 │     4  missing   ["eagle"]                     false  "mountain"    Asia              "eagle"

blending syntax

@chain db_table(db, mtcars_path)begin
       @mutate(model2 = model.upper().string_split(" ").list_aggr("string_agg",".").concat("."))
       @select model model2
       @head 2 
       @collect
end
2×2 DataFrame
 Row │ model          model2         
     │ String         String         
─────┼───────────────────────────────
   1 │ Mazda RX4      MAZDA.RX4.
   2 │ Mazda RX4 Wag  MAZDA.RX4.WAG.

use TidierData's `across` with window functions to calculate moving averagess

@chain db_table(db, plants_csv_path) begin 
       @group_by Plant
       @window_frame -3 3
       @window_order Date
       @mutate(across(MWh, (minimum, maximum, mean)))
       @head 5
       @aside @show_query _
       @collect
end
WITH cte_2 AS (
SELECT  MWh, Date, Plant, MIN(MWh) OVER (PARTITION BY Plant 
        ORDER BY Date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS minimum_MWh, MAX(MWh) OVER (PARTITION BY Plant 
        ORDER BY Date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS maximum_MWh, AVG(MWh) OVER (PARTITION BY Plant 
        ORDER BY Date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS mean_MWh
        FROM '/Users/danielrizk/Downloads/power-plant-generation-history.csv')  
SELECT *
        FROM cte_2  LIMIT 5
5×6 DataFrame
 Row │ MWh     Date        Plant   minimum_MWh  maximum_MWh  mean_MWh  
     │ Int64   Date        String  Int64        Int64        Float64   
─────┼─────────────────────────────────────────────────────────────────
   1 │ 564337  2019-01-02  Boston       469538       564337  5.17451e5
   2 │ 507405  2019-01-03  Boston       469538       564337  5.08793e5
   3 │ 528523  2019-01-04  Boston       469538       564337  5.0853e5
   4 │ 469538  2019-01-05  Boston       469538       613040  5.2346e5
   5 │ 474163  2019-01-06  Boston       469538       613040  5.26067e5

robust joining capabilities and cross database joins
over 60 tests demonstrating accuracy

@eval allows straight forward interpolation of values into functions with chains so that users can pass it various values.

Albeit, I am a bit less familiar/used to working with Julia collections, but I see TidierDB as a really nice SQL companion package for folks who want to work DataFrames and SQL backends while using a syntax that is straightforward for a lay person to follow and understand.

Ultimately, I think it comes down to preference. I think some folks prefer collections and others prefer datafames, so having options is good.

aplavin · September 24, 2024, 5:10pm

Thanks for the detailed writeup, now I also read about tidierdb a bit more carefully It does indeed look very convenient and suitable for those coming from the R approach of non-standard evaluation. Happy for that crowd – now they also have a familiar data manipulation interface in Julia!

Meanwhile, SQLCollections.jl takes the philosophy of regular Julia functions. Most fundamentally, it involves composability, in several senses.
Note that SQLCollections doesn’t even have to do anything special for these – it’s automatic whenever one writes Julia functions.

One sense is just the referential transparency: I can take an expression passed to a function, assign it to a variable, and pass that variable to the function – the result would be exactly the same. This isn’t how R or Tidier** macros work – there, one cannot simply assign the function/macro argument to a variable and pass it around.
For example, in regular Julia data manipulation, these are equivalent – and with SQLCollections, of course:

... filter(@o _.a > 3) ...
# is the same as
pred = @o _.a > 3
... filter(pred) ...
# is the same as
val = 3
... filter(@o _.a > val) ...
# is the same as
val = 3
func = >
... filter(@o func(_.a, val)) ...

Another composability manifestation is being able to write separate functions for parts of the pipeline. Again, this is ubiquitous in Julia, including SQLCollections:

do_some_selection(data) = filter(..., data)

... |> do_some_selection |> ...

In R, or with Julia macros, it’s also possible – but one needs to be careful to achieve that.

And finally, composability with a wide range of types – the exact same code and functions, from collections to SQL databases. Some, like filter, even work for dataframes, unfortunately dataframes don’t support other common data manipulation functions.

As for SQLCollections vs TidierDB comparison specifically, now that I understand it a bit more, I’d highlight these points:

SQLCollections uses the most general syntax applicable to many data types in Julia. While the only syntax similar to TidierDB (afaik) is TidierData, and it only supports one data type – dataframes.
SQLCollections lets you write the exact same syntax for Julia collections and for databases. TidierDB uses syntax similar to TidierData, but (afaik) not exactly the same – the former requires DB. prefix, so writing a single function to process both is impossible. (although you @drizk1 say that the syntax is the same, I’m confused)
SQLCollections doesn’t define any macros and the interface is function-based. This gives native familiar Julia semantics and composability, something nontrivial/impossible to get in a macro-based interface.

It’s nice to see different approaches to design when solving similar problems, highlights differences in background I guess

aplavin · September 26, 2024, 2:18pm

SQLCollections.jl is finally released in General

A nice synergy – that basically comes for free – is querying tabular files like CSV or parquet. See the QuackIO.jl package: it could already use DuckDB to performantly read such files into Julia, like read_csv(StructArray, "my_file.csv").
Now, with SQLCollections.jl, we can query these files without fully loading into memory, while still using the same Julia syntax:

using QuackIO, SQLCollections

data = read_csv(SQLCollection, "my_file.csv")
# data is an SQLCollection and can easily be queried :)

aplavin · October 4, 2024, 2:52pm

Currently, only single-table operations are directly supported in SQLCollections.jl. However, I noticed that you can just pass any FunSQL expression as the input – not just table names!
It enables operating on joins:

using SQLCollections
using FunSQL: From, Join, Get

X = SQLCollection(conn, From(:tbl1) |> Join(:tbl2 => From(:tbl2), Get.id .== Get.tbl2.id))

After that, you can work with X same as with any other SQLCollection. Julia and FunSQL composability make it work automatically.

But this isn’t the same syntax as one uses with plain Julia datasets, of course.
So I still plan to add FlexiJoins.jl joins support to SQLCollections

Jake · October 28, 2024, 11:43am

I am not a user of AccessorsExtra, or DataManipulation. Your writeup makes me want to try SQLCollections though.

With a Postgres Database, I got as far as data = SQLCollection(mydb, :tablename). I am wondering what to do next using base Julia to view or manipulate the result. A bit of a tutorial without using AccessorsExtra or DataManipulation would be appreciated.

Alternatively maybe these packages are essential for a good user experience with SQLCollections? In which case I will need to answer the question why AccessorsExtra and why DataManipulation.

aplavin · October 28, 2024, 1:08pm

The examples in the first post now start with a simple no-macro example:

data = SQLCollection(mydb, :tablename)
filter(Base.Fix2(>, 0) ∘ PropertyLens(:a), data)

Here, SQLCollection is from SQLCollections.jl, and PropertyLens is from Accessors.jl. This is to demonstrate how SQLCollections actually work, what kind of functions they take as inputs.
In practice, one uses the @o macro from Accessors.jl to create this function, instead of manually writing stuff like Base.Fix2 – that’s currently the next example in the first post:

filter((@o _.a > 0), data)

For a lot of data manipulation with SQLCollections, that’s all you need: Base functions like map/first/filter/… and the @o macro to create the inner function.

DataManipulation.jl is only needed for more involved ways to manipulate data that don’t have corresponding functions in Base – most notably, grouping. The best way to start using SQLCollections.jl is, of course, with all-familiar Base functions Made it cleaner by tweaking examples!

The design of SQLCollections.jl is that one takes a data manipulation pipeline that works on regular Julian collections (functions from Base, or more involved ones from DataManipulation.jl) and replaces all anonymous functions with Accessors.jl. The result would be a pipeline that works for both Julian collections, and for SQL tables.

Let me know if something is still not really clear!

Jake · October 28, 2024, 5:39pm

I got this line to work without an error. It gives me back another SQLCollection. How can I see what is in this collection?

aplavin · October 28, 2024, 6:01pm

Great!

Use collect(), it retrieves the data from an SQLCollection to a Julian vector.

Topic		Replies	Views
[ANN] SQLStore.jl: use SQL tables as persistent Julia collections Package Announcements	0	596	November 19, 2021
DuckDB, FunSQL.jl, reading data Data sql , duckdb	0	105	March 18, 2025
Tables.jl: a table interface for everyone Data tables	19	10688	November 19, 2018
DataFrames.jl - Choosing between the core functions and available libraries (Query.jl, DataFramesMeta.jl, etc) Data	10	2067	September 15, 2018
[ANN] TypedMatrices.jl – An Extensible Julia Matrix Collection Utilizing Type System Package Announcements package , announcement , linearalgebra , matrices	0	261	September 21, 2024

[ANN] SQLCollections.jl – use Julia data manipulation functions for databases

Related topics