Resurrecting universal database API

ExpandingMan · June 11, 2018, 2:03pm

The problem I’m having is that it’s looking like there will be a lot of awkwardness because DataStreams interfaces are already implemented in all of the existing database API’s. It’s sort of as if the universal DBAPI has already been implemented halfway on all of them, so enforcing consistency on the remaining parts might look weird. Ideally we’d make PR’s to each of the existing database packages to expand their DataStreams interface into one that includes a bare minimum of database interaction, but more realistically it’s looking like we’ll have to awkwardly layer our DBAPI on top of what’s already there in the hope that at some point down the road we can clean it all up.

ScottPJones · June 11, 2018, 2:08pm

Will you be at JuliaCon? Would be a good time to get a “birds of a feather” group together, if so.

ExpandingMan · June 11, 2018, 2:11pm

Unfortunately no. But we should connect on Slack, and probably get more of @quinnj’s input especially once we have a better idea of just what needs to be done.

jerlich · August 13, 2018, 12:07am

As a longtime MATLAB and R user, and as a new Julia user, I wanted to just add database support (including reading and writing BLOBS efficiently) is essential.

I wanted to switch to Julia for a new project, but the current lack of BLOB support has forced me to reconsider.

I don’t think I have the skills to write the code, but I would be very happy to help with testing and documentation.

StefanKarpinski · August 13, 2018, 12:13am

Pardon my ignorance but what is a BLOB? (Also, it’s alsways a bit odd when someone feels that something I’ve never heard of is absolutely necessary.)

js135005 · August 13, 2018, 12:17am

Binary Large OBject - very often this is how images are stored in a database.

randyzwitch · August 13, 2018, 1:56am

I’ve never used BLOBs, but they are part of the ANSI SQL standard (2011 maybe?). Given I’ve never used them, I’m not sure I’d say they are any more important to support than any other data type, but they also don’t seem like they’d be any different than any other data type either.

svaksha · August 13, 2018, 8:00am

In academia, HDF5 is a popular storage tool for researchers. When combined with git, either for version control or reproducibility[1], there is a trade-off - git does not diff a BLOB if the researcher changes the data s/he stored. ATM, am unaware of any tool for doing diff/merge style operations on binary data, HDF5[2] or not.

[1] http://www.activepapers.org/
[2] OTOH, HDF5 is probably not a good diff candidate, but well…

ScottPJones · August 13, 2018, 8:20am

Databases often have CLOB and BLOB types.
CLOB is Character Large OBject, and you need to know what the encoding is for this.
BLOB is, as @js135005 replied, a Binary Large OBject, essentially a Vector{UInt8}.
This is one reason I’ve always wanted a binary string type in Julia (and now have BinaryStr in my package).
If Julia stored the .ji files in a nice distributed persistent database, they’d be stored as BLOBs.

StefanKarpinski · August 13, 2018, 2:22pm

Not to start a debate again, but how is that different than String which can already hold arbitrary binary data? It seems to me that you either give the data a stringlike interpretation to the extent that you can, in which case for a UTF-8 interpretation, you get something very much like String, or you don’t give it a stringlike interpretation, in which case it seems like you may as well just use Vector{UInt8}.

ScottPJones · August 13, 2018, 2:27pm

Databases need to distinguish between text values (i.e. which need to have an encoding associated with them - the world is not all UTF-8, esp. with the stored data, as opposed how the text is output on the web, and also may need to deal with CR/LF vs NL conversions), and binary data (no conversions ever).

Also, frequently binary file formats have embedded text data, so you need to be able to take those parts and perform string operations on them.

String makes it rather difficult to keep those distinct, and doesn’t provide much/any help to the programmer trying to make sure they are not getting garbage in.

StefanKarpinski · August 13, 2018, 2:36pm

So you’re talking about something that doesn’t treat the data in a stringlike fashion at all? How is that not just a vector of bytes? Or something that treats some of the data in a stringlike fashion in a structured way? That seems like something far more than a string type.

ScottPJones · August 13, 2018, 2:38pm

Very simply, you can’t do things like take a SubString of a Vector{UInt8}, so you’d be left with the horrible performance of copying every embedded string.

StefanKarpinski · August 13, 2018, 2:39pm

You can take a view slice of a byte vector.

ScottPJones · August 13, 2018, 2:40pm

That just gives you a slice of a byte vector, not a string, and certainly can’t handle having different encodings.

As an example, a BLOB that contains both ASCII/Latin1 or UTF-8 strings embedded, likely with a 1, 2, or 4 byte length field preceding, and 1 byte code units, as well as UTF-16 strings, aligned, with a 2 or 4 byte length field preceding.
(which can be a good way of packing strings into a binary module, to use up less space for cases where UTF-8 blows up [i.e., most non-Western European languages])

Tamas_Papp · August 13, 2018, 3:15pm

While these can be legitimate use cases, I think that at this stage this just derails a very important discussion.

ScottPJones · August 13, 2018, 3:34pm

Well, all databases I’ve dealt with over the years have been careful about a distinction between binary strings and text strings.
A “universal” API that couldn’t handle that would not be very universal (and IMO, not very useful in many/most of the use cases our customers had when I was at InterSystems).

ExpandingMan · August 13, 2018, 3:53pm

At least the way I envisioned this, that sort of stuff would be handled by DataStreams, which it is already fully capable of doing.

What I wanted here was a way to

Create connection objects with connection strings with an interface which is independent of the SQL back-end.
Execute SQL query strings with an interface which is independent of the SQL back-end.
Have a uniform interface for downloading tabular data using Data.stream! regardless of the SQL back-end.
Have a uniform interface for uploading tabular data using Data.stream! regardless of the SQL back-end.

So, my intention is to take over the (defunct) package and implement the above in the PEP249 style, as was the original intention. Therefore, much like Steve Jobs I won’t have to do any actual work because it was already done for me by Steve Wozniak (Jacob Quinn).

Honestly, one the thing that has stopped me is that, fortunately, I’m increasingly only having to use PostgreSQL, so my former horrible situation in which I had to interact with several completely different database interfaces has been much alleviated. I’ll still need it eventually though, so I still volunteer to do it.

ScottPJones · August 13, 2018, 4:18pm

As long as the universal API doesn’t prevent you from using those types from DataStreams, there’s no problem.

DoktorMike · August 13, 2018, 9:52pm

Its history is not without complication but it is fairly standard in database lingo. Even if the interpretation binary large object explanation is post creation.

Topic		Replies	Views
A prototype for a minimalistic universal database API Data	4	704	August 20, 2018
SQLAlchemy.jl Resurrected Data	2	1319	October 3, 2017
PostgreSQL.jl no longer works in v0.6 Web Stack	27	3481	March 26, 2018
Recommendations for ODBC Data	6	1040	January 5, 2017
[ANN] JDBC.jl Rejuvenated Data announcement , dataframes	3	1221	March 20, 2018

Resurrecting universal database API

Related topics