Resurrecting universal database API

Based, on the other thread, it sounds like there is some interest in resurrecting the idea of a “universal Julia database API”. There are already some great Julia database packages, and thanks to ODBC.jl and JDBC.jl at least in principle we have everything covered, but currently there is no universal way of interacting with them.

The original attempt was DBAPI.jl. The simplest suggestion would seem to be starting out by updating that package for 0.7. Some packages may even have some old code for this sitting around.

Of course, this will only matter if all of the maintainers of the major database packages are on board.

The major packages I know of that we would need to get on board are

  • ODBC.jl @quinnj
  • MySQL.jl @quinnj
  • JDBC.jl @avik (I am also an admin on this, so I can be responsible for this one)
  • LibPQ.jl @iamed2 (at least for me, this one is more and more important as time goes on as Postgres really seems to be gradually becoming the most dominant database)

I’m probably missing some, so definitely feel free to add some in if you know of others. @ScottPJones has also offered to help.

I’m willing to take charge on updating DBAPI and JDBC.

Thoughts? From the other thread it looked like everyone would be on-board, but we haven’t looped in @quinnj yet.

11 Likes

What’s the proposed API? Selfishly, I would push for it to resemble the API that ODBC/SQLite both follow; LibPQ is fairly close and I think it had a slightly nicer sink-streaming api where you just pass the INSERT statement w/ the source to stream.

Happy to chat more here or on a GitHub issue somewhere.

It doesn’t have to be anything fancy, at this point we don’t even have a uniform way to interact with all the database packages, so just having that would be a big step.

DBAPI is just a standard connection, cursor thing, it wouldn’t give you much more than a standard way of feeding in SQL strings and fetching data (we can leave fancy stuff like ORM to a later time). Following the model of ODBC, SQLite sounds fine to me. If you want to give an overview I’m sure that would be helpful.

I just had pointed out to me the https://github.com/rbt-lang/DataKnots.jl package, maybe we should also take a look at that.
For me, I think it would be good to think of having something that isn’t limited to just tabular data.
If we can manage something that can deal with NoSQL databases as well as handle all of the capabilities of SQL databases, in an easy to use and easy to extend fashion, I think we might be able to accomplish something like Chris has done for differential equations, i.e. something not available outside of Julia (and maybe not even possible without a strong type system, traits, and multiple dispatch).

BTW, I’ve taken over as maintainer of Mongo.jl & LibBSON.jl, as soon as things have settled down with the Pkg3 changes, I’ll be updating them as well for v0.7/v1.0 compatibility.

4 Likes

I work with Postgresql on a daily basis, and I think R is much better than Python in this kind of job, and the reason is that database API in R is well integrated with its data wrangling tool dplyr.

https://db.rstudio.com/dplyr/

I wish to see Julia database API be well integrated with DataFrameMeta or Query.

2 Likes

I completely agree in principle, but in practice I think the overwhelming immediate need is for tabular data (there is a tabular data cult, and it’s hard to break people out of it).

We are doing this for Julia, so I definitely agree that we should go full-tilt toward making it Julian, whatever that entails.

Anyway, I’m very interested in DataKnots.jl and the whole Rabbit Query language. It would be really great if we could just contribute to existing “professional” solutions as opposed to just shoehorning in some ad-hoc thing (this is definitely not my area of expertise, so you can be sure I won’t come up with anything incredible). Perhaps we can loop in the creators and see how realistic it would be to use it as a standard?

It’s worth considering though that even if we do wind up with something “fancy”, if it’s going to be something that just depends on passing SQL strings anyway, we can (perhaps?) start with the lowest layer.

2 Likes

I’ve opened an issue here. The discussion doesn’t necessarily have to stay in that repo, it’s just a convenient place to put it for now.

1 Like

I at the very least would like us to be thinking of these other sorts of databases (even graph databases, as well),
possibly having a basic general API, with extended APIs for tabular data, and on top of that, a SQL specific API, all compatible.

I’ve worked before on the other side of the equation (on the back-end code getting data from the database engine, to implement ODBC and JDBC, and on the driver for the client), so I know at least some of the issues.

1 Like

Is there any connection between the LibBSON and BSON packages? Looks like an opportunity for some synergy here.

I hadn’t seen the BSON package - I just took over maintenance of the packages because the previous maintainer was not able to continue supporting them, and nobody else stepped up after a long period of time to do so.

I’ll look into, at soon as I get my string stuff squared away (just waiting for a few more METADATA PRs to be merged).

I’m on board to implement any interface people decide on, in some package like LibPQDBAPI (or a resurrected PostgreSQL.jl or something). As long as I am not the primary maintainer moving forward.

This interface is largely inspired by Python’s DB API 2.0, which is in the public domain: PEP0249.
(source)

Is it still true/goal? If it is not then maybe new name would be better.

R and Python inherited DBI from Perl (AFAIK), so choosing one of these three and mimicking it exactly seems like a perfectly fine idea. And just stop there, one package doing one thing.

Getting to tidyverse levels of integration, mapping NoSQL databases onto the same relational framework, Spark integration…all these things can come after if people desire.

I absolutely will contribute to this package, as I work for a database vendor and this helps our future goals. My vote is just to continue with DBAPI.jl, implement it using SQLite.jl syntax, and get a proof-of-concept done as quick as possible.

7 Likes

That’s a good goal, in and of itself, in the short term, and harmonizing the SQL access (making sure we can still access any vendor extensions), will make any more ambitious harmonization easier.

Note: I wasn’t saying that NoSQL databases needed to be mapped onto a relational framework, but rather that we should think about how we can better access different types of databases in as consistent a fashion as possible (for hierarchical, graph, document databases [or documents embedded in a relational database - we might look at what Postgres and others have done for accessing JSON stored in the database for inspiration, also look at what InterSystems has been doing [they’ve mixed hierarchical, relational, and object models successfully for decades])

This statement doesn’t conflict with mine. But I do see them as separate problems, and something that the a DBI package doesn’t need to consider. Should the other databases use DBI as inspiration, sounds great.

Sorry if I’m missing the obvious thing, but what’s the reason to have the same (or even similar) interfaces for databases as different as PostgreSQL, MongoDB, Aerospike and Redis, for example? In my experience, client libraries are easier to use when they provide API resembling the native API of the database itself instead of implementing some “standard” interface. For example, MongoDB and its Python client use nearly the same syntax, which does not look anything like SQL.

In theory, the idea is to make it so you can swap out the backends. It doesn’t remove the need/desire for also implementing the custom features of each database, but the common 70% or whatever.

For example, connection strings, submitting a query, etc. Most databases have similar connection strings (if they support ODBC), select statements are mostly the same (if they support ANSI SQL), etc.

1 Like

I’m not talking about SQL so much, as to lower level things.
Aerospike has “AQL”, for example, submitting an AQL query could possibly be consistent with an SQL query, in the options for how Julia returns the data (with missing, tuples, vectors for columns, whatever).

So it only applies to SQL databases. More specifically, SQL databases with row-oriented storage OR column oriented storage - I can hardly imagine a case when you switch your users table from MySQL to, say, Vertica or Redshift, since even simple SELECT * FROM users WHERE id = {id} have totally different run times in them. The whole point of having that many databases is that they are optimized for different things, and you just take the one that works best for your case.

As far as I know, AQL is only represented by a command line utility and not implemented in any of the clients. Also, making any queries for more than one record by its ID is terribly slow.

Absolutely. But think about the other case, where you have been using MySQL and want to move to Vertica or Redshift. A DBI package might allow you to re-use some of the current code, but you’d get a massive performance improvement (up to the point where the common functionalities have the same code).

Also, the purpose of the package is to have a common API, not that the operations necessarily have to be the same. So the Redshift package can do things the Redshift way for performance and for specific features, but if they are common, the input API and what is returned could be common.

1 Like