Status of Julia database libraries

fborda · May 29, 2019, 2:43am

I’m on the side that intensively uses (and likes) databases at work using multiple languages (creating data pipelines and analytics) and I’ve been toying with the idea of putting some effort into a Julia SQL Builder as I feel like multiple dispatch is a near optimal paradigm to create a very integrated and extensible mapping to SQL (similar to Elixir’s Ecto). Julia can definitely be a strong player in this area with easier to use tools.

But what I didn’t figure out is what’s the best way to integrate the specific database backends. Octo.jl and SearchLight.jl add all supported databases as dependencies and directly handle the translation. What I was thinking was to define a common interface as you said, and multiple repos with wrappers for each supported database (instead of the direct PRs). The wrapper could also support a simpler interface for raw queries like Python’s Records, and allow others to implement the interface and have the same functionality. And I could build the SQL Builder Layer over it.

Although of course, it can’t be something that ends up so complex that I wouldn’t be able to work it out by myself (with my limited free time), which is my main fear. And I mostly use Postgres, but that would require testing with multiple databases (unless I focus on creating the interface but only implement it for LibPQ.jl).

Tamas_Papp · May 29, 2019, 5:20am

FWIW, it is my impression that in economics, Stata is the default solution, and knowledge of SQL is rare.

But debating this is pretty irrelevant — if someone finds databases super-useful, and has the time/resources to work on Julia packages, we will have this, otherwise most likely not. I tend to agree with the spirit of

even though I would phrase it differently — I am sure that database support will mature as the language community grows. I just don’t think that discussions like this can make that happen any faster.

dlakelan · May 29, 2019, 1:40pm

if you go this route I’d recommend looking at the R database API. lots of data analysis people will be familiar with it, it seems reasonably well thought out and it already relies on a dispatch like system (not multiple but it at least won’t get in your way)

fborda · May 29, 2019, 2:19pm

I’ll look into it, though I was thinking more along the lines of the adapter/wrapper being a simple way to generate a connection string, establish a connection and a generic execute method with a unified way of feeding the prepared statements to the raw sql (maybe through a non standard string) and a way of selecting the output format. Basically the minimum possible to use the database (with maybe a way to retrieve the underlying db interface for more specific tools)

Above that, the SQL Builder layer would work similar to SQLAlchemy Core (except simpler), creating a direct mapping by using the multiple dispatch system. For example max(a::SQLFragment, b::SQLFragment) would generate another SQLFragment allowing to compose the queries (writing functions exactly how you would if it were numbers), while the underlying struct can keep track of every parameter. Plus utilities to manage transactions, and if possible manage connection pools, validation (similar to Clojure’s Validateur in my head).

I don’t really intend to go higher level than that and focus on being more feature complete with SQL, or at least allowing it to be easily extended for that purpose, and well documented (one of Ecto’s greatest features). Higher level layers such as integrating with Query.jl or anything ORM can always be built above it though.

dlakelan · May 29, 2019, 2:36pm

That’s basically all the R API is, just a way to get connected, feed queries, and get big results sets. I guess it also has a way to read a whole table or save a data table back to the DB, and ask the DB about things like whether a table exists or how many rows your query returned etc.

It’s well thought out for a data analysis use case, it’s pretty minimalist, and in general for data analysis you wouldn’t want all the max(a,b) stuff, that’s much more useful to treat the database as a kind of “permanent version of RAM” but in data analysis cases, you want “hand me a big table of data so I can analyze it”. The usage is more batch based.

Since the higher level abstractions can be built on the batch abstraction I think that batch abstraction would make good sense as a base data-connection layer, and I think reinventing the wheel here is guaranteed to slow down the process, R’s database API works well for millions of people doing data analysis, a Julia version of it would build on all that knowledge.

fborda · May 29, 2019, 4:13pm

It’s not exactly reinventing the wheel since it’s how the tools I already use work, and they cover my specific use case (dynamic generation of hand optimized queries), which unfortunately might not be the most general use case for the language. And honestly, it seems like a fun project.

But I’m aware it’s not what everyone might want, so I wanted to build it on top of a more basic layer (which is basically what LibPQ and others already provide without a common interface, which is probably why there hasn’t been much motivation for doing it besides consensus stuff). I’ll use that for reference, but I don’t feel it’s the best way as is. Immediately, Stuff like dbFetch and dbClearResult would overshadow the already implemented Table.jl interface (which is already more feature complete), stuff like dbWriteTable has the same annoyance I have with pandas to_sql (like no support for custom upsert rules) and the way of argument binding seems kind of cumbersome.

And in general it’s so low level (outside of the few dataframe methods) that it would be basically a remapping of LibPQ and others functions to other functions with different names, at which point it could be better to just try to get the PRs there. And in general dataframe support might be better done by supporting all iterables in one batch insert function (list of tuples, named tuples, structs and dataframes). But I have to really think about it.

alejandromerchan · May 29, 2019, 5:07pm

I just got hired to work on a big database with the state of California and part of the reason I got that job was that I did a project with Julia using that data. Now I’ll get to work from the inside and hope I can implement some database stuff in Julia. I haven’t started yet, so I’m still unclear about my future responsibilities and how much freedom I’ll have to implement new stuff, but I hope I can add to the database tooling soon. There’s a lot for me to learn first, because I’ve never worked with databases before, however thanks for reviving this discussion, it has been very helpful for me as I prepare for the next step.

dlakelan · May 29, 2019, 6:56pm

yeah, I didn’t mean to knock the idea of higher level interfaces, just don’t necessarily think that they are the right level for all uses.

I guess my point was a common middleware interface would be the first stepping stone to higher level interfaces such as yours that would be workable with different backends, and that if someone is looking to put a common middleware, they could do a lot worse than emulating the one people in the data analysis community have thought about a lot in R. If a DBI like interface were available in Julia it would smooth the movement from R to Julia for people looking to analyze data that’s housed in big databases.

Also I didn’t mean to say the R DBI should just be parroted function for function, but rather its basic organization could be considered as a good foundation, and then it could be adapted to the specific advantages that Julia has, such as the DataFrame / Table interfaces etc.

fborda · May 29, 2019, 8:01pm

We are actually in agreement then. I think it should definitely aim for feature parity (something you can have a translation section on the documentation for people coming from R), but playing to the strengths of Julia (in particular Julia’s polymorphism).

dlakelan · May 29, 2019, 10:53pm

Right, for example in R you call a function to get a DB Driver object, like SQLite(). I don’t think that adds much of anything to anything. Basically that should be something like

using DBISQLite

and then you call dbConnect to get a connection object. I’d suggest that be simply a constructor object for a SQLiteConnection

something like:

c = SQLiteConnection(datafile::String)

instead of dbGetQuery(conn,query) I’d suggest

q=DBQuery(conn,query)

creates a query object that uses the connection… doing whatever it does based on underlying capabilities (ie TCP/network vs file vs directory of files or whatever)

and then something like

table = db_fetch(q,n)

returns a DataFrame with n rows from the query q, where if you leave out n it fetches all the rows.

Stuff like that unifying the interface to MariaDB/MySQL and Postgres and SQLite and a couple other DBs would provide a nice abstraction layer that feels Julia like due to the subtyping and multiple dispatch while still having a lot of familiarity to anyone who’s used to something like R’s DBI

fborda · May 30, 2019, 12:06am

Yes, even when thinking about the SQL Builder layer I was considering it to use the dispatch in order to support non SQL queries like graph databases (such as ArangoDB) with the same interface. Which is part of why I considered a waste that it doesn’t already exists in Julia.

Basically you only ever use an Adapter. I assume the very minimal interface can be just open a connection, close a connection and execute a query (and return data). Assuming it’s a SQL database there is a SQLAdapter <: Adapter, and every other method is implemented using that minimal interface (batch insert, table creation, metadata, it can even fake prepared statements if the type of DB doesn’t support it). Then each SQLAdapter (such as PostgresAdapter) can specialize any method it if it’s advantageous. And as long as a not SQL storage implement the methods you need then it will work as well.

But what I’d like to test is something like fetch(q, MyStruct/NamedTuple/DataFrame), which could allow it to return any type instead of just a dataframe (and also possibly enable more complex type conversions). I think having the number of rows as am argument kind of uncommon in my experience, it’s either get() (for queries that return a single match) and all() (and you control how many by LIMIT/OFFSET).

dlakelan · May 30, 2019, 12:32am

it’s critical to be able to offline process truly big datasets. like if you have a billion rows, each with a couple tens of KB of string data and you want to process them in batches of 1M at a time… limit will require you to rewrite and requery over and over.

fborda · May 30, 2019, 1:03am

Wouldn’t it be better to work towards a streaming function for those cases? Especially if you decide to support server side cursors? Though I’ll do some more planning and try to make a prototype this weekend.

dlakelan · May 30, 2019, 1:10am

Yeah, that’s what I was thinking the n was for, so you could re-call fetch over and over to get the next batch, with the cursor being kept track of in the query object.

quinnj · August 4, 2019, 1:21am

I somehow missed the flood of posts that happened here last May. A few thoughts:

I agree with @ExpandingMan and a few others who have mentioned that Julia seems to have a fairly decent “minimal” coverage for database libraries; from a combination of ODBC.jl, JDBC.jl, SQLite.jl, MySQL.jl, and LibPQ.jl, I think we have a lot of the “basics” covered
Outside those packages, support can vary widely and most other wrapper libraries probably implement the pieces the author needed, but probably not much more
A lot of people over the years have commented on the desire/need for something like DBAPI; I’m really curious why people feel this would be so important? I mean, I get the “common API for any database” obvious answer, but when I think about this more realistically, I just don’t know if I see the value. In various workflows over the years, I’ve worked with a specific database here, or another there, but rarely have I been in the case where I literally need to swap out exact commands from one database to another; in cases where the database does need to be changed, it usually requires a significant refactoring anyway, so swapping out the “connect” and “query” commands is usually the easiest part (vs. table creation + column type differences, syntax changes, different supported sql functions, etc.). So in short, I’d really be interested to hear why people think this is so important and what other value they get out of having something like DBAPI; as a contributor to a lot of database packages, I’m in a pretty good position to implement something like this, but I’ve just never really seen the value.
I think one of the biggest things we could do as a community to improve database reliability, robustness, and coverage is better coordinate. The all-time hardest thing maintaining ODBC.jl is the never-ending bug reports for specific database vendor issues that are almost impossible for me as a maintainer to reproduce or diagnose. Similar to the suggestion to have a central information center for database support, I think it’d be wildly helpful to know who is interested in contributing to various database vendor packages, who can help test things in various systems/setups, and where the biggest pain points are.
The reason I came back to this thread is that I’m gearing up for another round of updating various database packages and I wanted to remember some of the points made here. I’d love to hear where people’s biggest pain points are and what could use the most love and updates. Feel free to respond here or ping the #databases channel in slack.

fborda · August 4, 2019, 4:00am

For myself (as a user) having a DBAPI-like isn’t the top priority (sort of biased since I just use Postgres though) if it will take effort that could be used to support each database custom features (such as Postgres COPY, LISTEN, etc…), but I think it’s an inevitable step that must be done if anyone wants to create any higher level database library (such as an ORM). Basically, if you want to support more than one database backend you’ll inevitable have to write a DB-agnostic wrapper, even if you don’t expose that wrapper to the end user.

The advantage of having a standard is that you don’t risk having lot’s of different libraries implementing their own incompatible/hidden/incomplete/poorly documented version of that layer, and instead everyone can focus on one single version. And for users, having a common format is great since it makes it easier to get help online (especially for less used databases which will inevitably have a lot less tutorials/examples online and users will have to rely solely on the provided documentation, which is usually more technical and is sometimes thin on side stuff like exceptions/error handling).

But since it’s already something necessary to anyone creating a higher level library database libraries, any of those future libraries can also expose that wrapper in a separate package so others can build over them, after all Julia’s ecosystem is about composition. And in my opinion the most important layers will be the very foundation in which everything will be built followed by the top which will be the easy to use interface for the common user (basically in importance: Julia’s Psycopg2 > Julia’s SQLAlchemy > Julia’s DBAPI), especially since the top library will also have to connect directly on the foundation to provide access to the unique tools of each DB.

essenciary · August 4, 2019, 10:32am

Check https://github.com/GenieFramework/SearchLight.jl/blob/master/src/QueryBuilder.jl and SearchLight in general (docs missing / pending ).

fborda · August 4, 2019, 2:24pm

Some query examples would go a long way in understanding how to use the library. Searchlight documentation seems a lot lacking compared to Genie. Skimming through it, it’s not clear to me how do you add OR clauses (without embedding in the string) for the where/having when your only composition mechanism is the + operator, which seems like an AND (at least depending on how the final query building with deal with the list). I also have to assume that no where = SELECT *.

In general I’m not convinced about using + for all composition, I’d prefer a more LINQ/Ecto.Query layout (but without macros if possible, the tricky part being ‘from’ variables), though that’s probably because of my previous experience. Plus there is still a lot of string around (while the query builders usually only have fragments/string for the cases they can’t reasonably cover). Of course, it’s a work in progress and it’s hard to expect feature parity with any of those (especially while working with Genie as well).

essenciary · August 4, 2019, 5:07pm

Sure, there’s many flavours and to each, his own.

Here’s a demo which should give an idea and answer some of your questions:
https://github.com/essenciary/SearchLight_JuliaCon18/blob/master/SearchLight%20--%20Tame%20Your%20Databases.ipynb

Alec_Loudenback · August 5, 2019, 2:44am

I work with ODBC.jl to interact with SQL Server at work, so I appreciate the work everyone has done on the Database libraries already.

I don’t think I’ve ever had errors or issues with the package itself. Two things I very consistently find myself doing is:

Converting the SQL types into Base types so that I can work with them much more easily. E.g. SQLDate to Dates.Date
Normalizing the column names from the query result to make it easier to work with in DataFrames

I opened issues for these two things and am happy to talk more on Github if that’s easier, though both these things really aren’t specific to ODBC.jl alone.

Topic		Replies	Views
Resurrecting universal database API Data	60	4759	August 17, 2018
The naming of JuliaDB.jl Offtopic	47	4570	June 7, 2017
PostgreSQL.jl no longer works in v0.6 Web Stack	27	3567	March 26, 2018
Julia losing popularity among Data Science users (KDnuggets Software Poll) Community	146	20251	June 23, 2018
Lack of stable PostgreSQL-communication a major showstopper Data	33	4950	January 8, 2021

Status of Julia database libraries

Related topics