Status of Julia database libraries

Agreed. This is a chicken-egg scenario. R has their DBI, which allows packages like dbplyr to exist, and Python has Ibis (which is a bit less successful). Since there really isn’t a 500-lb gorilla in the Julia community yet like RStudio, the commonality is less important because no one is trying to catch up to an instant eco-system.

FWIW, as someone who is thinking about making a DBI-compliant R package, it sucks. Being DBI compliant for Python isn’t much better (just to the extent that Python is somewhat easier than R to develop packages). So without a really compelling use case, I don’t see Julia having a DBI-like package any time soon

1 Like

The value of it depends a lot on the circumstances in which you’ve landed yourself. At some point, in my job, there was a whole mess of different databases (or query interfaces at least) which I might have to use at any given time: Postgres, MS SQL, presto, AWS Athena. There were indeed times when I would be asked to “swap” between them. In this sort of environment the DBAPI felt important: if a different database is to be used, the interface has to change, usually in a whole bunch of different places all at once. I still believe that a general interface would be a good thing to have. Granted, it’ll be more valuable to some than for others.

There was a time when this was highly relevant to me and I had every intention of taking this on myself. I was waiting for the Julia data ecosystem to mature somewhat, but by the time that happened, my own need for this was dreceasing drastically (I was working on a major project that was going to be all Postgres and another that was just fetching data directly from REST API’s on demand).

Right now I’m finding myself in the position of needing more new ways of fetching data. At the moment my colleagues seem increasingly eager to use Databricks for everything (Databricks is something I’m extremely skeptical about for all sorts of reasons). I don’t know what the future of interacting with that will look like for me, but it’s likely to remain true that I will not have the need for a DBAPI that I would have had two years ago.

My biggest concern by far right now is that we have effectively lost the JDBC wrapper until Julia 2.0. JDBC was a really nice fallback because you can always be confident that a driver would be available for it. Beyond that, at least where I am I see an increasing reliance on Spark, which suggests that I might increasingly have to pull in the entire JVM as a dependency in any case, so knowing that you can essentially interact with both Spark and databases in a straightforward way through JavaCall in Julia was highly valuable, and it’s all gone for the foreseeable future. Lately I’ve been contemplating how much I’ll need to use SQLAlchemy through PyCall.jl as an alternative.

(NOTE: @StefanKarpinski has indicated that it is not necessarily true that this issue can have no satisfactory resolution until 2.0, so please ignore that particular implication in my text above.)

2 Likes

On DBAPI abstraction, I’m not pushing anyone, but it seems like a simple thing? Just make at least one function “query” generic (more?) would go a long way. I’m not sure, can a macro help? And more importantly, because of e.g. JDBC issue, could the abstraction, not only abstract away the database type, but also make it easy to switch between JDBC.jl, SQLAlchemy through Python, ODBC.jl (or is that the only point of this package?) and native drivers?

I quoted this in my latest discourse thread, and asked about a (possibly) more general issue this JavaCall/JDBC issue.

You probably know best about JDBC.jl, but it’s not really lost, if you are ok with using the long-term support (LTS) versions of Julia, 1.0.x; or if you wait for at least 1.4, as 1.3 is heading for feature freeze, and release of 1.2 (and 1.3) on about 10th of August.

It’s not likely that you have to wait for Julia 2.0 at least.

I’ve not looked into this obscure PyCall issue that affects JDBC, Spark/Scala and e.g. GitHub - aviks/Taro.jl: Read and write Excel, Word and PDF documents in Julia but I understand it to be on the updated JVM side and can you just use an older JVM?

My understanding is that until 2.0 one will require a custom-compiled version of Julia. At 2.0, assuming somebody does the PR for it, there will be an additional option for ccall which makes it possible to call the JVM without conflict. If I’m wrong about this, please somebody correct me.

Since I’m much more actively involved with Julia than anything else, for me it would be extremely painful to have to use the LTS version of Julia. It would probably already break some of my code. I will custom compile if I absolutely have to, but where I work our deployment process is already horribly broken, so I’m avoiding this as long as I can.

Where in the world did the idea that this has to wait until 2.0 come from? None of this has anything to do with Julia’s API. Adding an option to ccall is not even a breaking change.

5 Likes

I’m so confused, this is how I interpreted the discussion in the issues. Could you enlighten us with a comprehensive description of what can happen here?

Thread title changed at OP’s request.

Julia 2.0 is mentioned nowhere. Also being discussed over here:

The only place that I can see anyone talk about “we have effectively lost the JDBC wrapper until Julia 2.0” is your post that said that. Where are you getting this idea from?

Clearly I thought that someone was implying that. I’d be happy to delete any posts I made to that effect if you think this really is wrong. I’m pretty sure there is another Julia issue that discusses it somewhere, but I’m not sure which.

Maybe edit your post to indicate that this isn’t the case. I talked with Jeff, Jameson and Keno earlier today and they agree that there’s no reason this would be impossible before 2.0. Nor would 2.0 somehow make it possible without modifying the JVM.

I’ll do that. Just so I’m less confused in general, what exactly would the plan be here? Is it indeed the case that what’s needed is an alternate ccall or an option to ccall and that could be introduced in 1.4?

I think it goes without saying that it’ll always be somewhat problematic if either the JVM, Julia or both always require custom compilations for this to work, though I’m not necessarily claiming there is a way around that.

(I’m feeling particularly feckless and useless here as this is something that’s potentially of really major consequence to me, but I’m totally out of my depth on the actual problem, so I’m likely to contribute absolutely nothing to the whole issue rather than standing around feeling upset :disappointed:. Might be a good reason to get more educated on LLVM and threading.)

I’m reading the same thread that you are, the situation is laid out there better than I could explain it.

1 Like

@ExpandingMan I re-read that issue and see it had 1.3 milestone than 1.4 milestone and now it is without any. I am just curious if you dig up more information about this problem.

You are also writing:

From wiki: … ( ODBC ) is a standard application programming interface (API) for accessing database management systems (DBMS) .The designers of ODBC aimed to make it independent of database systems and operating systems.

I think same or similar motivation could be written for ODBC than for julia DBAPI.

And you could also try to search how python DBAPI is useful.

Think for example panda’s people (see read_sql)! They don’t need to maintain different DB interfaces for people who wants to read data from DB into Dataframe.

Their users (data scientists) usualy don’t need to refactor select * from data_table if they have need to analyze data stored in PostgreSQL or sqlite or whatever DB. I am talking about premature optimization in problems where loading data from DB is just non significant part of task.

I am only am occasional database user, so pardon the obvious ignorance, but with ODBC already a standard API and already available in Julia, what would DBAPI.jl add? Support for databases that choose not to follow this standard? Which ones would these be?

As I said, I don’t use these too often, so this is not an attempt to toss a pebble in the pond. I’m just curious.

Look what ODBC.jl say:

The ODBC.jl package provides high-level julia functionality over the low-level ODBC API middleware. In particular, the package allows making connections with any database that has a valid ODBC driver, sending SQL queries to those databases, and streaming the results into a variety of data sinks.

So you already have/need some julia way how to access and work on DB.

You have three options:
a) make some standard protocol (simplest is borrow or get inspired by some useful one)
b) keep liberty for package developers to guess or win some de facto standard
c) ignore standardization and enjoy possibility that everything could be optimized (or screwed) in different way

ODBC.jl is probably (who knows?) using DBInterface where as I see now quinnj made this year this commit. So maybe we are in b) situation now?

I prefer a) but I am not complaining that we are not there! It needs some kind of maturity of language and community. I was just trying to answer question why it is useful to have (some day) good julia standard DB API.

About your question - you need to install and configure ODBC drivers and environment (and also maintain it then) to use it. Some ODBC driver are not free (which could or could not be a problem).

For example think about people who are using python + DB client + DBAPI package. Now if they are thinking about trying julia - they have additional complication. (They need to install and configure ODBC drivers or install pycall and try to glue it with python dbapi package or download data from DB to file using python , … ).

Think again about pandas developers. They could say: “Hey put SQL statement and DBAPI object into our function and it will work” instead of “Please forget about your favorite DBAPI package install and configure ODBC driver and you could use it in our function”.


Look at doc for ODBC.jl function load is without asterisk load. An doc for LibPQ.jl function load is with asterisk load!. You could see that if you start with ODBC (which seems to have pretty simplistic API - you really could not close connection to DB??) and wants to port program to using LibPQ.jl (not only because you need to clean memory from unnecessary connections) you have compatibility problems.

BTW I am pretty confused that load or load! function is used to save (!) data into database. (ODBC.jl doc is saying: “… it allows one to send data to a DB.”. (load and send are quite oposite for me but it could be just me! :slight_smile: )) LibPQ.jl doc is going from my POV more confusing (for me!) way by describe this function LibPQ.load!(table, connection::LibPQ.Connection, query) where query parameter is insert statement!

I don’t want to dishonest people who create packages and docs!!! I just like to show example where maybe some kind of constructive cooperation could be useful. Useful not to discourage some from using julia and help others to have best experience.

1 Like

@Liso, the issue with the JVM has been resolved by providing the environment veriable JULIA_COPY_STACKS. If this is set to 1 the JVM should work, at a small performance penalty for multi-threaded routines. You can indeed use JavaCall.jl and JDBC.jl now on 1.3 on. I believe there was some discussion about whether a slightly different C calling convention to make this work better, but I’m not sure if there is still an issue up for that.

3 Likes

Sorry to revive this old post, but is JDBC.jl currently the way to go for using databases from Julia?

1 Like

Depends on the Database, and your environment.

For many common databases, such as Mysql, Postgres, SQLite, there are specific drivers. For less common databases, you can use ODBC.jl or JDBC.jl based on your familiraty. If a JVM is already installed in your environment, JDBC.jl is pretty easy to use.

4 Likes

I found this thread while doing some research regarding how well (or not well?) Julia supports interfacing with postgres.

It’s been a number of years and it looks like this thread went quiet. Does that mean support was significantly improved? I found this package called libpg.jl, but perhaps there’s something else which is suggested to use?

It looks like there are also two other libraries.

I’m not sure what the differences between these things are. If someone happens to be familiar and can tell me that would be much appreciated. Will try to figure it out of course…

btw I saw some strange comments about SQL databases not being widely used

  • Back in 2018 I was working in a research science field on a nuclear physics experiment
  • The data from this experiment was stored in flat files, essentially a highly compressed binary format. The people involved in the experiment were not using SQL, but they were investigating using it for the next generation of experiments
  • There were two reasons for this. They struggled to manage data using the existing simple approach, which consisted of flat files and some directory structure. Also, the rest of the experimental physics world was starting to adopt these kind of technologies. They were still years (decades) behind the software and technology world, but they had recognized that managing large quantities of data becomes easier when you use technology to do it for you.
  • There were several problems with flat files. You usually end up with a large number of them, and need to index them somehow. A SQL database is ideal for this indexing purpose.
  • Metadata is difficult to manage with flat files, but trivial to manage by creating ad-hoc database tables.
  • Given these considerations I would imagine SQL is now widely used across the scientific research world. Larger research collaborations were of course using it long before we were looking at it. We were decades behind CERN for example.

A few reflections from industry:

  • SQL is about the simplest kind of data storage technology which exists with the exception of files on disk.
  • If you work for any kind of technology or software company you will encounter them from day 1.
  • My first job after the research physics was working for a fintech. SQL was pretty much the first thing I learned while working there. It wasn’t hard to learn, it didn’t take long to learn (maybe a week to become pretty competent)
  • The current state is that SQL is actually considered pretty uninteresting. By which I mean it’s a general purpose tool. There are all kinds of niche data storage technologies available now, each one tuned and designed for a slightly different context and application. This is necessary to get many times the performance for certain types of data operations than something like a relational database can provide. This isn’t even the frontier. These niche technology products have been around for quite some time now.

I hope these reflections offer some useful and interesting insight.

Just my opinion: To say scientists avoid SQL is a bit like saying scientists only use Fortran.

There was a time when such statements were generally true, but to say such a thing in 2024 (or even 2018, probably) is a bit like saying in 2010 I only know Fortran and haven’t learned either C++ or Python yet.

I was a scientist in 2018 and I was not using SQL, but to say that is really quite embarrassing as it’s such a simple and ubiquitous technology that everyone with any technical knowledge of programming should understand how to use it.

1 Like