DuckDB, FunSQL.jl, reading data

Expanding Man Wednesday at 2:27 PM

using DuckDB to interact with files, granting full query ability for files, really is great. The only thing that sucks about it is having to deal with SQL. I’m increasingly tempted to build a bit of abstraction that let’s you do a bunch of common operations without having to do that

1

Miguel Raz Guzmán Macedo Wednesday at 2:33 PM

have you checked out FunSQL.jl?

Alexander Plavin Wednesday at 2:33 PM

Sounds exactly like SQLCollections.jl :slight_smile: “Use the exact same data manipulation syntax for Julia collections and SQL tables”

Alexander Plavin Wednesday at 2:33 PM

Example with file reading specifically:

1

Alexander Plavin Wednesday at 2:34 PM

(based on FunSQL.jl under the hood)

Alexander Plavin Wednesday at 2:35 PM

Most of SQLCollections tests are literally “given a Julia data manipulation function, check that the results are the same for an in-memory table and for an SQL table”.
See eg SQLCollections.jl/test/runtests.jl at master · JuliaAPlavin/SQLCollections.jl · GitHub.

Expanding Man Wednesday at 2:46 PM

hm, my problem is that usually if I care about doing this, I don’t want to read the whole thing into memory. As I recall QuackIO does that when you do read_csv

Alexander Plavin Wednesday at 2:48 PM

See the example screenshot above, it compares two approaches: the first reads everything into memory (basic in-memory table, no SQLCollections), the second performs all operations on the duckdb side with SQL, and only loads the final result into memory. (edited)

Expanding Man Wednesday at 2:49 PM

oh really? woah I completely missed that… I have to look back at quackio

1

Expanding Man Wednesday at 2:51 PM

I’m confused… lookng at QuackIO, I thought it’s just the select string that goes to the duckdb query? how is it getting the rest of the query string?

Alexander Plavin Wednesday at 2:52 PM

QuackIO does very little here – there’s a small integration layer at QuackIO.jl/ext/SQLCollectionsExt.jl at master · JuliaAPlavin/QuackIO.jl · GitHub to return a “lazy” SQLCollection instead of “eager” Julia table.
Most of the lifting is done by GitHub - JuliaAPlavin/SQLCollections.jl. (edited)

Expanding Man Wednesday at 2:52 PM

yeah, but doesn’t it wind up reading the whole thing into memory? like here, you are running DBI.execute on a bare select`, so doesn’t that read the whole thing into memory? QuackIO.jl/ext/SQLCollectionsExt.jl at master · JuliaAPlavin/QuackIO.jl · GitHub

Alexander Plavin Wednesday at 2:53 PM

There’s limit 0 :slight_smile: Basically, to get the column names, nothing else. (edited)

Expanding Man Wednesday at 2:54 PM

whaaaat, ok I clearly did not at all understand how this works. So what knows how to do the filtering and stuff is this DBI? Or I guess DBI can now iterate over the whole table? but does it do that without loading into memory? I clearly have to go and look at what’s actually happening inside the DDB wrapper

Alexander Plavin Wednesday at 2:55 PM

It’s SQLCollections.jl – see eg SQLCollections.jl/src/readfuncs.jl at master · JuliaAPlavin/SQLCollections.jl · GitHub and SQLCollections.jl/src/func_to_funsql.jl at master · JuliaAPlavin/SQLCollections.jl · GitHub.

Expanding Man Wednesday at 2:55 PM

so sqlcollections is somehow doing that in the course of pulling the data from the result of DBI.execute?

Alexander Plavin Wednesday at 2:57 PM

DBI.execute is used twice: first, with limit 0, just to get column names.
And then, on the final sql query with all the selects and wheres to get the end result.What SQLCollections.jl does, is it builds the final SQL query from Julia functions.

Expanding Man Wednesday at 2:59 PM

wooo… ok this stuff does way more than I thought, I kind of thought this was all syntax sugar that I didn’t care about. No wonder you seemed to think it was pointless for me to want to put ability to select cols directly in QuackIO

Expanding Man Wednesday at 3:05 PM

I feel like in QuackIO there should be a more explicit distinction between materializing to an object in memory and creating a table that interacts with some database object. Right now it seems very magical

Alexander Plavin Wednesday at 3:11 PM

Yes, maybe… The main idea of SQLCollections is to make it as seamless as possible to switch between Julia in-memory data manipulation and SQL-backed. QuackIO continues this trend.

Alexander Plavin Wednesday at 3:11 PM

Do you have a nicer user-facing interface in mind?

Daniel Rizk Wednesday at 3:17 PM

I’m going to put in a quick plug for TidierDB as well. db_table soon to have db alias accepts file paths (including compressed files). It grabs the metadata and stores it, but does not read into memory. From there TidierDB allows basically all available join types, inequality joins, joining TidierDB queries on each other, ability to work with dates, allows user defined functions, all preexisting sql and sql aggregate functions as well. I’ve written 90 tests to ensure accuracy as well.The next release will allow for unnesting columns of structs and arrays as well.I’ve been using it to work with about 50 gigs of compressed medical data over the last few weeks.

tidierorg.github.io

TidierDB.jl

TidierDB.jl

tidierorg.github.iotidierorg.github.io

Unnesting Columns - TidierDB.jl

TidierDB.jl

3

Expanding Man Wednesday at 3:57 PM

I personally really do not want all the table abstractions of Tidier, I just want to read stuff without reading the entire table into memory and mostly use simple functions from Base. I’ve mostly moved away from even having tables in my actual code.

3

Expanding Man Wednesday at 3:59 PM

I don’t have any firm ideas about an interface, but I think something along the lines of a QuackTable object that you can refer to without loading into memory, and then functions to materialize it into objects. I think there should at least be a harder distinction between referring to some table reference and actual table data in memory, I don’t necessarily know what that should look like.

Alexander Plavin Wednesday at 6:58 PM

Thanks for such a PoV!
For a bit more context, SQLCollections are useful by themselves as well – as a Julian interface to SQL tables; QuackIO just adds support for csv/json/… file reading.The overarching idea is that an SQLCollection should be a drop-in replacement for a Julia collection/table, whenever possible.
Under this view, the QuackIO interface is very natural: e.g., read_csv(StructArray) reads a csv as a StructArray, and read_csv(SQLCollection) reads as an SQLCollection. The former is eager, the latter is lazy, but they aim to support the same base Julia interface.Do you think this confusion (lazy/eager) could be solved with better docs/docstrings, or some interface adjustments could be helpful?

Expanding Man Wednesday at 7:29 PM

this may not be a common confusion, part of my problem is that I actually looked at the source code of some of this stuff (apparently not carefully) and just assumed when I saw the select queries getting run that I was seeing stuff get copied from disk. That said, to me at least there is a huge distinction between reading the whole thing from memory or using ddb to read it lazily somehow. Maybe this would be obvious to someone more familiar with SQLCollections, but to me it was a big surprise. Related, it wasn’t super obvious to me when looking at SQLCollections what it was actually doing.I think, if I were to do something similar, I would try very hard to get SQLCollection to be a StructArray-like object itself, that loads stuff as lazily as possible via ddb queries. The disadvantage of this is that this is not really how a SQL table wants to behave, so you’d probably need a bunch of dedicated functions, but I’ve had a lot of success avoiding stuff like DataPipes et cetera in favor of simple, much more generic stuff centered around Base, so pulling in the dependencies of SQLCollection is not something I personally am not that enthusiastic about.

jarjar Wednesday at 7:48 PM

I’ve mostly moved away from even having tables in my actual code.

I’m curious what you mean by this? In julia my “tables” are normally just StructArrays. Are you avoiding these? What are you doing instead?

jarjar Wednesday at 7:49 PM

(Btw DataPipes is extremely generic, it’s just a piping macro, nothing else. Works great with Base because it pipes by default into the last position so it works with map/filter/…)

Alexander Plavin Wednesday at 8:11 PM

I think, if I were to do something similar, I would try very hard to get SQLCollection to be a StructArray-like object itself, that loads stuff as lazily as possible via ddb queries.

Hmm, that’s basically what it already is…
(not duckdb-specific, I use it with sqlite and mysql as well) (edited)

Alexander Plavin Wednesday at 8:14 PM

I’ve had a lot of success avoiding stuff like DataPipes et cetera in favor of simple, much more generic stuff centered around Base

I also do data manipulation in Julia mostly with Base functions, stuff like map and filter. They work nicely with a lot of containers, no reason to switch to anything else really :slight_smile:
DataPipes.jl is a pure syntax transformation, for convenient piping. It’s design is driven by Base functions interface. (edited)

Expanding Man Wednesday at 10:04 PM

Hmm, that’s basically what it already is…

It didn’t seem to me that you could, for example, easily extract a single column, and it didn’t seem to allow striaghtforward map over named tuples. You could do this, for example, by inferring the schema and using it as a type parameter, though I acknowledge it may not be simple to get it to work nicely.

jarjar Wednesday at 10:08 PM

julia> @p [(;a=1, b=10), (a=2, b=20)] map(@insert _.c = _.a + _.b) 2-element Vector{@NamedTuple{a::Int64, b::Int64, c::Int64}}: (a = 1, b = 10, c = 11) (a = 2, b = 20, c = 22)

Expanding Man Wednesday at 10:09 PM

I’m not sure what this is showing, can you do this with SQLCollections?

jarjar Wednesday at 10:11 PM

That’s showing what datapipes does. It’s just piping + Accessors. Presumably exactly the same code would work on SQLCollections

Expanding Man Wednesday at 10:14 PM

oh, I think we are talking about different things. That example might work with SQLCollections but I was not able to do e.g. map(r -> r.colname, tbl), it doesn’t seem to behave like an iterator of rows

I don’t know, I don’t have it in front of me right now, first several things I tried when I was playing with it earlier didn’t work

jarjar Wednesday at 10:20 PM

Yknow I’m wrong

conn = DBInterface.connect(FunSQL.DB{SQLite.DB}, “tmp.db”) tab = SQLCollection(conn, :tab) julia> @p begin tab map(@insert _.c = _.a + _.b) end ERROR: SQLCollections: mapping #7 not supported

Expanding ManExpanding Man Wednesday at 10:22 PM

yeah, so that’s basically what I was saying

Alexander PlavinAlexander Plavin Wednesday at 10:26 PM

easily extract a single column

tbl |> map(@o (;_.colname)) should work

Alexander Plavin Wednesday at 10:27 PM

allow striaghtforward map over named tuples

What exactly do you mean? There are a lot of tests that do map over an SQLCollection the same way as one could do for a Vector of NamedTuples. E.g., tbl |> map(@o (a=_.j * 10, b=_.i + _.j + 1)). (edited)

Alexander Plavin Wednesday at 10:28 PM

See tests SQLCollections.jl/test/runtests.jl at master · JuliaAPlavin/SQLCollections.jl · GitHub for lots of examples on what is supported :slight_smile:

jarjar Wednesday at 10:28 PM

What’s wrong with my example?

Alexander Plavin Wednesday at 10:29 PM

It passes an opaque anonymous function to map, surely one cannot convert an arbitrary Julia function to SQL.

Alexander Plavin Wednesday at 10:29 PM

Compare

opaque function, cannot be converted to SQL: julia> x → (;x.colname) #3 (generic function with 1 method) # inspectable function, can be converted: julia> @o _.colname (@o _.colname)

(edited)

jarjar Wednesday at 10:30 PM

Which part is the opaque bit? @insert or + or _.a ? (edited)

Alexander Plavin Wednesday at 10:31 PM

Doesn’t matter what’s inside, the anonymous function is always opaque by itself. (edited)

Expanding Man Wednesday at 10:31 PM

I mean in general you would have to read the whole Julia AST and turn it into SQL somehow, so yeah I can see why it would be super hard to just make it generic

Alexander Plavin Wednesday at 10:32 PM

In general, yes, but it works surprisingly well and covers a lot of usecases already :slight_smile:

Expanding Man Wednesday at 10:32 PM

or you can just try to infer when it’s ok to load, like in jar’s example you could have just loaded it at minimal extra cost (edited)

Alexander Plavin Wednesday at 10:32 PM

you could have just loaded it

Loaded what?..

jarjar Wednesday at 10:32 PM

I’m not really seeing a fundamental difference between @o _.a and @insert _.c = _.a?

Alexander Plavin Wednesday at 10:33 PM

Let’s ignore @p for now – just display and check exactly what you pass to map. (edited)

Expanding Man Wednesday at 10:35 PM

Loaded what?..

I just mean that in jar’s specific example, if you read out columns a and b into memory, it really doesn’t cost you anything if you intended to materialzie the whoel table a, b, c. But this is rather specific to that example, so

Alexander Plavin Wednesday at 10:37 PM

If you want to load into memory, then just call collect before this map, and it becomes a regular in-memory Julia table :slight_smile:

Expanding Man Wednesday at 10:39 PM

yeah, I get that, I’m just trying to think of how else this could work, I don’t have any good answers for that. Actually, I think since you can’t just magically compile Julia code into sql, I could wind up going the opposite direction and just using FunSQL until I can get something I’m willing to load into memory. Don’t know, I’ll have to play around a lot more to see what I think

in any case, SQLCollections is really cool, thanks for pointing it out

Alexander Plavin Wednesday at 10:42 PM

FunSQL is great, it’s the SQLCollections backend :slight_smile:
I just want to write a single code that can operate on SQL or in-memory tables – not just “convenient syntax”, but literally the same code. This naturally leads to certain limitations, but I find them reasonable and not too constraining in practice. (edited)

jarjar Wednesday at 10:44 PM

Ok let’s see.

% sqlite3 tmp.db <<<‘create table tab(a int, b int); insert into tab(a,b) values (1,10),(2,20);’ julia> conn = DBInterface.connect(FunSQL.DB{SQLite.DB}, “tmp.db”) julia> tab = SQLCollection(conn, :tab) julia> collect(tab) 2-element StructArray(::Vector{Int64}, ::Vector{Int64}) with eltype @NamedTuple{a::Int64, b::Int64}: (a = 1, b = 10) (a = 2, b = 20) julia> map(x → x.a, tab) ERROR: SQLCollections: mapping #5 not supported julia> map((@o .a), tab) |> collect ERROR: SQLCollections: mapping (@o .a) not supported julia> map(row → (;row.a), tab) ERROR: SQLCollections: mapping #7 not supported julia> map((@o (;.a)), tab) |> collect 2-element StructArray(::Vector{Int64}) with eltype @NamedTuple{a::Int64}: (a = 1,) (a = 2,) julia> map((@o (;.a, .b, c=.a + _.b)), tab) |> collect 2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype @NamedTuple{a::Int64, b::Int64, c::Int64}: (a = 1, b = 10, c = 11) (a = 2, b = 20, c = 22)

(edited)

Alexander PlavinAlexander Plavin Wednesday at 10:45 PM

map(@o _.a, tab) is not parsed as you’d expect :slight_smile:

1

Alexander PlavinAlexander Plavin Wednesday at 10:46 PM

Also, I’m not sure what exactly map((@o _.a), tbl) should result in: definitely not an SQLCollection?.. Because SQL tables fundamentally consist of columns, it cannot just correspond to a vector of numbers… (edited)

Alexander PlavinAlexander Plavin Wednesday at 10:47 PM

But I also thought about this, because it would be convenient in usage.

jarjar Wednesday at 10:48 PM

sqlite> select 1+1; 1+1 -------------------- 2

That’s a column named 1+1 with one row (edited)

Alexander PlavinAlexander Plavin Wednesday at 10:49 PM

Ok, so?.. It’s still a table with columns, not just a collection of numbers. (edited)

jarjar Wednesday at 10:50 PM

Sql takes an expression to a (column name , column). In this case you have an expression and a column and you’re asking how to get a column name, and i think the answer is string(ex) ?

Alexander PlavinAlexander Plavin Wednesday at 10:51 PM

But would that be useful? I’d expect map((@o _.a), tbl) |> collect to result in a Vector of numbers (if a was a numeric column).

Alexander PlavinAlexander Plavin Wednesday at 10:51 PM

And it’s possible in principle.
I had an idea to add something like row_func field to SQLCollection, a function to be applied to each row when collected. Then, map((@o _.a), tbl)'s result would have row_func = only… (edited)

jarjar Wednesday at 10:52 PM

I’d have to think more about it

Alexander PlavinAlexander Plavin Wednesday at 10:54 PM

For now, I only added stuff that clearly and unambiguously maps to SQL. That’s already quite a lot :slight_smile:
These less-clear-cut cases need extra thoughts and considerations – they are welcome, but care has to be taken to remain consistent. (edited)

1

jarjar Wednesday at 10:56 PM

Is there a way to write my @insert so that it works?

jarjar Wednesday at 10:56 PM

without listing all the existing fields as in

(@o (;_.a, .b, c=.a + _.b))

Alexander PlavinAlexander Plavin Wednesday at 10:58 PM

With @insert specifically? Not now, I think.
Note there’s mapinsert in DataManipulation.jl. It was originally designed for performance for column-based tables (eg, mapinsert is faster than map(@insert ...) for a StructArray), but nicely suits SQLCollections as well: mapinsert(c=@o _.a + _.b) should work (edited)

jarjar Wednesday at 11:04 PM

That’s nice. Thanks.

jarjar Wednesday at 11:05 PM

julia> mapinsert(c=(@o _.a + _.b), tab) |> collect 2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype @NamedTuple{a::Int64, b::Int64, c::Int64}: (a = 1, b = 10, c = 11) (a = 2, b = 20, c = 22)