Resurrecting universal database API

The problem I’m having is that it’s looking like there will be a lot of awkwardness because DataStreams interfaces are already implemented in all of the existing database API’s. It’s sort of as if the universal DBAPI has already been implemented halfway on all of them, so enforcing consistency on the remaining parts might look weird. Ideally we’d make PR’s to each of the existing database packages to expand their DataStreams interface into one that includes a bare minimum of database interaction, but more realistically it’s looking like we’ll have to awkwardly layer our DBAPI on top of what’s already there in the hope that at some point down the road we can clean it all up.

Will you be at JuliaCon? Would be a good time to get a “birds of a feather” group together, if so.

Unfortunately no. But we should connect on Slack, and probably get more of @quinnj’s input especially once we have a better idea of just what needs to be done.

As a longtime MATLAB and R user, and as a new Julia user, I wanted to just add database support (including reading and writing BLOBS efficiently) is essential.

I wanted to switch to Julia for a new project, but the current lack of BLOB support has forced me to reconsider.

I don’t think I have the skills to write the code, but I would be very happy to help with testing and documentation.

Pardon my ignorance but what is a BLOB? (Also, it’s alsways a bit odd when someone feels that something I’ve never heard of is absolutely necessary.)

1 Like

Binary Large OBject - very often this is how images are stored in a database.

I’ve never used BLOBs, but they are part of the ANSI SQL standard (2011 maybe?). Given I’ve never used them, I’m not sure I’d say they are any more important to support than any other data type, but they also don’t seem like they’d be any different than any other data type either.

In academia, HDF5 is a popular storage tool for researchers. When combined with git, either for version control or reproducibility[1], there is a trade-off - git does not diff a BLOB if the researcher changes the data s/he stored. ATM, am unaware of any tool for doing diff/merge style operations on binary data, HDF5[2] or not.

[1] http://www.activepapers.org/
[2] OTOH, HDF5 is probably not a good diff candidate, but well…

1 Like

Databases often have CLOB and BLOB types.
CLOB is Character Large OBject, and you need to know what the encoding is for this.
BLOB is, as @js135005 replied, a Binary Large OBject, essentially a Vector{UInt8}.
This is one reason I’ve always wanted a binary string type in Julia (and now have BinaryStr in my package).
If Julia stored the .ji files in a nice distributed persistent database, they’d be stored as BLOBs.

Not to start a debate again, but how is that different than String which can already hold arbitrary binary data? It seems to me that you either give the data a stringlike interpretation to the extent that you can, in which case for a UTF-8 interpretation, you get something very much like String, or you don’t give it a stringlike interpretation, in which case it seems like you may as well just use Vector{UInt8}.

Databases need to distinguish between text values (i.e. which need to have an encoding associated with them - the world is not all UTF-8, esp. with the stored data, as opposed how the text is output on the web, and also may need to deal with CR/LF vs NL conversions), and binary data (no conversions ever).

Also, frequently binary file formats have embedded text data, so you need to be able to take those parts and perform string operations on them.

String makes it rather difficult to keep those distinct, and doesn’t provide much/any help to the programmer trying to make sure they are not getting garbage in.

So you’re talking about something that doesn’t treat the data in a stringlike fashion at all? How is that not just a vector of bytes? Or something that treats some of the data in a stringlike fashion in a structured way? That seems like something far more than a string type.

Very simply, you can’t do things like take a SubString of a Vector{UInt8}, so you’d be left with the horrible performance of copying every embedded string.

You can take a view slice of a byte vector.

That just gives you a slice of a byte vector, not a string, and certainly can’t handle having different encodings.

As an example, a BLOB that contains both ASCII/Latin1 or UTF-8 strings embedded, likely with a 1, 2, or 4 byte length field preceding, and 1 byte code units, as well as UTF-16 strings, aligned, with a 2 or 4 byte length field preceding.
(which can be a good way of packing strings into a binary module, to use up less space for cases where UTF-8 blows up [i.e., most non-Western European languages])

While these can be legitimate use cases, I think that at this stage this just derails a very important discussion.

2 Likes

Well, all databases I’ve dealt with over the years have been careful about a distinction between binary strings and text strings.
A “universal” API that couldn’t handle that would not be very universal (and IMO, not very useful in many/most of the use cases our customers had when I was at InterSystems).

At least the way I envisioned this, that sort of stuff would be handled by DataStreams, which it is already fully capable of doing.

What I wanted here was a way to

  • Create connection objects with connection strings with an interface which is independent of the SQL back-end.
  • Execute SQL query strings with an interface which is independent of the SQL back-end.
  • Have a uniform interface for downloading tabular data using Data.stream! regardless of the SQL back-end.
  • Have a uniform interface for uploading tabular data using Data.stream! regardless of the SQL back-end.

So, my intention is to take over the (defunct) package and implement the above in the PEP249 style, as was the original intention. Therefore, much like Steve Jobs I won’t have to do any actual work because it was already done for me by Steve Wozniak (Jacob Quinn).

Honestly, one the thing that has stopped me is that, fortunately, I’m increasingly only having to use PostgreSQL, so my former horrible situation in which I had to interact with several completely different database interfaces has been much alleviated. I’ll still need it eventually though, so I still volunteer to do it.

1 Like

As long as the universal API doesn’t prevent you from using those types from DataStreams, there’s no problem.

Its history is not without complication but it is fairly standard in database lingo. Even if the interpretation binary large object explanation is post creation. :blush: