Resurrecting universal database API

Speaking for myself at least, many times this is an “unfeature” and, as a practical matter at least, it gets in the way. Almost always what I need is just to get some data into a tabular format, which in Julia essentially means a Dict of AbstractVectors (e.g. DataFrame). In this case usually it’s an obstacle rather than an aid that every database is different. I can’t tell you how many times I just wanted to “query a database” and wound up frustrated by the fact that the underlying API’s are completely different despite the fact that the data format they were supposed to be giving me was exactly the same. I have a harder time with this problem in Julia, in part because we have no uniform “user level” interface.

Indeed, in trying to solve my database problems in Julia I’ve dug into sqlalchemy a bit and I can tell you that quite a bit of the annoyances and missing features of that package simply come down to the fact that all of the things it’s designed to interface with are completely different and they just had to make a choice somewhere. I think we might have an opportunity to do significantly better since we already have dedicated packages for lots of databases and we can probably do some really nice things with method extensions that were impossible in Python.

If we are happy with the PEP249 approach that would make things extremely simple, since we could primarily just resurrect DBAPI. That wouldn’t be that hard, I can probably do it in a few hours.

I agree in principle, but as a practical matter I just don’t know what to do right now to support lots of different types of databases, so you’d need to give us lots of input to have much hope of that I think. If we go the PEP249 approach we can have something very soon.

Before we get too ahead of ourselves, it’s also worth thinking about whether there are any existing efforts that might be happy to have us embrace them. @ScottPJones pointed out Rabbit Query Language which looks interesting and they certainly seem to have done a lot of work. It’s probably a somewhat different use case, but for all I know there are lower level parts of it that one could easily plug the databases into. I don’t think I’ve ever spoken to the maintainers of it. Are there any other existing efforts that anyone knows of? I seem to remember there being one other that had some crazy way of using vcat to mimic SQL in Julia syntax.

I forgot, what I really wanted to say was that this never ever works for me in real life, I very much wish I could do things that way. Sometimes even if its a database that I’m the primary user of, if I ask for it to be a particular type of database the response is basically “go screw yourself”. I’m sure lots of people are in the same scenario.

2 Likes

In particular, I think that Julia’s multiple dispatch, types, and traits could let us (even just for tabular and relational database access) could let us have a lot more commonality + ease of use, while allowing the full power of all the extensions that one (or more) databases might support.

I don’t want just a LCD approach, more a way of accessing the same functionality if present on different databases. If I don’t use anything more than saving/loading simple tables, the backend I use could be pretty much anything (i.e. an array could be serialized [not using Julia’s serializer!] and then stored on Aerospike or any other KV store, or stored in a hierarchical db in either row major or column major form, or with JSON or CSV, to SQLite or any other relational DB).
If I want to use say ANSI SQL, then there should be a consistent interface for that, etc.

Hopefully this isn’t just a pipe dream!

2 Likes

As I said, I feel that may be a great first step, and sorely needed.
I just want to raise awareness that we should at least look at what else is out there, what solutions are already present, their pros and cons, and keep them in mind when making decisions as a group on the Julia DB API.

1 Like

Ok, let’s bring it to the extreme: what’s the common subset of efficient operations supported by both - Aerospike and Vertica?

Aerospike is a fast key-value database with possibility of data eviction when disk space reaches high watermark. It’s blazing fast for inserting and retrieving individual records by their ID and terribly slow for any kind of multi-record queries. It also has a number of limitations such as maximum bin (field) name (14 bytes) and maximum record size (128K).

Vertica is a columnar database designed for batch processing and analytical queries. It implements efficient data aggregation, but individual inserts and fetches are deadly slow.

Aerospike and Vertica have literally void overlapping set of (efficient) operations, so it’s extremely unlikely that you will ever switch of even consider them in the same context.

My point is that you may have better time designing a good common API for SQL databases so that a user could:

  1. switch between equivalent DBs (e.g. MySQL <-> PostgreSQL or Vertica <-> Redshift)
  2. have a well-known interface integrated with the rest of Julia infrastructure.

This is what ODBC and JDBC successfully implemented, so no reason we can’t do it as well. But in NoSQL world things are really, really diverse, so I’d suggest to put it aside and implement separate packages for each NoSQL DB.

1 Like

Yes, this is the proposal, and what GitHub - JuliaDatabases/DBAPI.jl: A new database interface proposal was originally intended for.

:+1:

3 Likes

I think @randyzwitch got it exactly right: it would be great to have a simple package that handles the SQL case. All the other stuff can be parallel or on top of that.

Some folks mentioned query translation from a higher level query language. Query.jl is designed to do that, the implementation is just super, super incomplete (essentially one can run exactly one very specific query and it will get translated to SQL, nothing else works).

Indeed one of the strengths of the tidyverse is that, as an user, it lets you do database things in a very abstract way. JuliaDB comes close (and will eventually be there), but not quite.

I think this provides a good inspiration as to what the “universal” API may look like from an user perspective.

1 Like

Ok, I’ve done some digging around and investigating.

I’ve done some very basic upgrading to 0.7 of DBAPI to make it a bit more easily readable, you can see this here. Also, here you can see what I had done when I modernized JDBC which was based on the previously existing DBAPI interface.

Looking around at the various existing database interfaces there are basically 3 objects

  • Connection for managing the connection. This seems to be the simplest and most universal. Rarely do I ever see much in the way of function calls on it.
  • Cursor seems basically to track some sort of internal state of the query, which is pretty much what’s described in PEP249. This often seems to get packaged or mixed up with actual queries, often simply because it’s not needed as a separate object.
  • Source (or Sink) which can act as the Data.Source (or Data.Sink) and is basically a fully specified query such that the tabular structure of the data to be fetched (or pulled) is fully determined.

I imagine API wrappers looking something like

DBAPI.connect(cnxn::PackageConnection) = package_connect(cnxn)
DBAPI.query(csr::PackageCursor, q::AbstractString) = package_query(csr, q) # returns PackageSource

et cetera, which is much as they have looked in the past. It would be most unobtrusive if we do not require package objects to inherit from DBAPI abstract types; I’m still contemplating whether this is feasible (I think it is because DataStreams can work that way).

One major obstacle (or at unpleasantness) I foresee is that for the DBAPI to really be useful it should be integrated with DataStreams, but existing database packages already have their own DataStreams interface, so it might be extremely awkward for these to coexist.

Any thoughts? I’m going to try to do a mock-up of something that I think would work for both LibPQ and JDBC as a starting point.

1 Like

Hello. I think that something like ODBC/DBAPI is important to have, with a conceptually simple function: a universal adapter to tabular data sources. I say tabular, because you need to have some assumptions of the resulting data structure; and tabular structures are state-of-the-art and the 90% case. Having MySQL, PostgreSQL, SQLite easily usable out-of-the-box should be the goal.

I don’t think DataKnots, an implementation of Query Combinators, is relevant to this need. DataKnots is an algebra that you could extend to construct domain specific query languages. Currently, DataKnots lacks documentation. However, I’ve collected a few links to the general Query Combinator approach at https://querycombinators.org; including the first few sections of a forthcoming paper, Query Combinators for Clinical Research. This link also has information about a precursor, HTSQL. The first release of DataKnots will support in-memory structure and computations. Later releases, funding permitted, we’ll extend it to federate queries (when possible) into SQL so that one could integrate data sets across a great many existing data sources. Hence, in this regard, even DataKnots.jl will need something like DBAPI.

Anyway, I’ll be at JuliaCon with a talk on DataKnots. With any luck, I’ll have documentation and Kyrylo will have enough to call it an alpha release. Thank you so much for your enthusiasm. It’s a pretty cool project and we’re happy to share our thoughts and efforts. Julia is truly a unique programming environment and one that is particularly well suited to this kind of data processing.

3 Likes

Thanks for your feedback

Yeah, that’s pretty much what I thought was the case from looking at the source code. I was just sort of grasping at things that looked like “universal” user interfaces to multiple back-ends. Regardless, looking forward to getting everything working together seamlessly! Hopefully we can be useful for letting you plug DataKnots into existing databases if that’s indeed your goal.

Important to a DBAPI is that fetching data from a database is a stream, and not returned as a completed data frame. This is necessary for several reasons: (a) the result set might be very-big/infinite and it only needs to be processed in small chunks, (b) the target structure might not be a DataFrame and hence you wouldn’t want to have intermediate temporary data. Of course, there should be easy ways to convert this stream into a DataFrame.

1 Like

Yes, we will be using DataStreams.jl for that.

@cce You might also want to take a look at Query.jl and QueryOperators.jl (essentially an implementation of LINQ for julia). The underlying abstraction that ties things together there is an iterator of named tuples interface that is defined in TableTraits.jl, with lots of source and sink implementations partly in various table packages and partly in IterableTables.jl.

David, Thank you. Yes, and we’ve looked at both of those; they are excellent libraries for processing tabular data-sets. We’re coming at things with a perhaps less common problem, we have heterogeneous medical data that are often federated with quite complex relationships. Cheers! - Clark

Neither LINQ nor Query.jl are restricted to tabular data. Both LINQ and Query.jl support just any iterator of something, including hierarchical stuff (a la the LINQ to XML story). I’m still trying to wrap my head around the difference to your design there :slight_smile:

Hehe, I know exactly what you mean (but from the systems end of it, supporting object and relational models on top of a B+ tree hierarchical database, which you may have dealt with: Caché).

Last year I struggled a lot trying to query a PostgreSQL database using native Julia APIs and only partially succeeded using Cxx.jl by calling C++ code that used pqxx libraries. No problem attaching to the database and getting simple strings back into Julia but I could NOT get multiple rows or mixed data type returns to unpack properly into Julia. Lots of mucking about with “unsafe_string(pointer(connection))”. Please, please make this easier!! :slight_smile:

I think what is needed to start with is a static (singleton) connection so that once it’s made it can be reused. Multiple DB connections are to be avoided if possible. (IMHO)

Fortunately you’d no longer have to go through any of that because of LibPQ. This is more about having a universal interface that’s compatible with all database back-ends currently available in Julia. It might take a while to navigate through how best to do this, but writing the code itself should still be relatively quick, I still have every intention of doing this.

That will still depend on the back-end. On most of them, you shouldn’t need to re-create a connection on the same database, but different databases always require new connections.

This will also be useful I think for the place I just started consulting for (in Julia, I’m so happy!!!)
Even if I have to do this on my own non-billable time, I’m up for contributing as much as possible, it’s so needed.

1 Like