Highly concerned with the state of database libraries


#1

TLDR;

My experience with the DB access libraries has been overwhelmingly negative - I’m concerned that the Julia ecosystem is missing a huge foundational piece for success. Things don’t look any better as we’re moving towards v1.0, the respective repos being almost deserted in terms of activity. These libraries are too important to be left at the mercy of open source contributors and we need a solution fast. A v1.0 language with v0.1 DB access libraries is of no use for a large array of projects.

Rant

With the risk of sounding like a broken record – it was only a year ago when I was raising the alarm about Julia losing support for PostgreSQL as its only library, at that point, became abandonware – I’m afraid I have to say it again: I consistently run into critical issues with Julia’s DB libraries. The packages are undeveloped, insufficiently tested and poorly supported.

For my perspective (my focus being web development and general computing with Julia) this is got to be one of the worse possible problems.

To put things in context, I’m using Julia and Genie to develop a decently ambitious web app: a reasonably complex web app for handling hotel bookings and reservations. It’s really nothing crazy but it’s not a toy/theoretical project either: we’re talking about some 300K+ hotels in a DB table plus related data.

I’m also the developer of SearchLight, a Julia ORM - so I’ve had a lot of exposure in the last few years to the lower level DB libraries (MySQL, SQLite and PostgreSQL). Thanks to SearchLight I can easily swap DB backends - and as it’s customary in web development, I started with a SQLite DB in development. SQLite seemed to work, until I started importing the 300K hotels data from JSON files. After a few thousand DB operations during the import, Julia would segfault. Issue reported here, no response yet despite the severity of the problem: https://github.com/JuliaDatabases/SQLite.jl/issues/146

OK, that was bad. But luckily, I could switch to MySQL. That worked OK in v0.6 but it turns out unusable in v0.7. OK, sure, “it’s pre v1.0, what can you expect”, I know the story - but bear with me, I’ll address this. Here again, it usually works, but when used with production requirements, you end up with horrible problems. Like the impossibility to run a simple select query. Look at this: 45s, 200 M allocations and 12 GiB! For a query which takes 3.1ms in the MySQL client.

julia> @time MySQL.query(conn, "SELECT `hotels`.`id` AS `hotels_id`, `hotels`.`facilities` AS `hotels_facilities`, `hotels`.`themes` AS `hotels_themes`, `hotels`.`nr_rooms` AS `hotels_nr_rooms`, `hotels`.`country` AS `hotels_country`, `hotels`.`destination` AS `hotels_destination`, `hotels`.`nr_bars` AS `hotels_nr_bars`, `hotels`.`hotel_score` AS `hotels_hotel_score`, `hotels`.`nr_restaurants` AS `hotels_nr_restaurants`, `hotels`.`availability_score` AS `hotels_availability_score`, `hotels`.`checkin_to` AS `hotels_checkin_to`, `hotels`.`checkout_from` AS `hotels_checkout_from`, `hotels`.`address` AS `hotels_address`, `hotels`.`zipcode` AS `hotels_zipcode`, `hotels`.`currencycode` AS `hotels_currencycode`, `hotels`.`regions` AS `hotels_regions`, `hotels`.`checkin_from` AS `hotels_checkin_from`, `hotels`.`phone` AS `hotels_phone`, `hotels`.`descriptions` AS `hotels_descriptions`, `hotels`.`longitude` AS `hotels_longitude`, `hotels`.`name` AS `hotels_name`, `hotels`.`email` AS `hotels_email`, `hotels`.`latitude` AS `hotels_latitude` FROM `hotels` WHERE (`hotels`.`id` = 218) ORDER BY hotels.id ASC LIMIT 1")
 44.534217 seconds (202.54 M allocations: 12.060 GiB, 0.75% gc time)

Here is the full issue: https://github.com/JuliaDatabases/MySQL.jl/issues/113

In v0.6 as well, there was a problem with a finalizer error (and it still is in v0.7). Again, not an issue when “playing” with MySQL, but when used to perform tens of thousands of operations the output of the error would clog the REPL and slow it down till it had to be killed. Not ideal for web apps which are expected to run uninterruptedly for months at a time.

Why is this a very bad thing?

  1. First of all, because it all seems to work. But in fact, these libraries have not been properly tested and these problems are only discovered late in the development process when it’s extremely costly to switch to a different tech stack (honestly, you’d be forced to drop Julia altogether because one just can’t deliver a product in these conditions).

  2. Then, when things break, you’re pretty much on your own. You wanted to show off Julia – well tough luck, now you have to explain how come that basic features don’t work. Luckily, this is a project I’m doing with friends in our spare time, so no huge loss. Sadly, it’s impossible in these conditions to recommend Julia at work. It’s cool that we have DataFrames and Query and DataStreams but if these run on such a shaky foundation, they’ll crumble.

  3. “Yes, but it’s Open Source and it’s pre 1.0 - it’s normal”. Yes, but these are infrastructure critical projects (I said it before and I will repeat it everytime I can). It’s right there with File IO. Look at Elixir, to take an example of a highly successful new language (to stay in the realm of web development, which I know well): the core team released the language, the web framework (Phoenix) and the ORM (Ecto). They are all under the core team’s umbrella.

And what’s worse, it doesn’t look like things are on the right track either – as we’re getting close to v1.

Look at the state of the contributions - it doesn’t look good for v0.7 / 1.0.

Don’t get me wrong, the contributors are Open Source heroes and I’m grateful for all their work. It’s not a statement about them. They’re doing amazing work but it is what it is - they need help. I suggested in the past that Julia Computing uses their organizational and funding clout and experience to properly manage these DB “infrastructure” libraries. I still think that’s the best approach, but maybe other people have different and better ideas.

If I’m wrong or too pessimistic, please correct me - I could really do with a more optimistic version of events. Because the way things are today, basically I can’t use Julia to build pretty much anything. And that is frustrating as hell, especially as things are worse than a year ago when I was raising the same problems but was talking about Postgres. Yes, Julia is an amazing language and there are amazing features and libraries which work very well, but for web development we need proper DB access libraries.


Rust vs Julia
#2

I sympathize with your predicament, but perhaps this is an exaggeration? Most scientists in fact purposefully avoid having anything to do with SQL.


#3

Ha so true. Having worked in both fields I can relate to needing postgres to just work, but also not caring at all what postgres even is.


#4

Sure, that’s why I mentioned web development all the time. I have no doubt that other users of the language have different needs – although a lack of proper DB access libraries would be a considerable negative for the adoption of any language.

As for web development, usually, regardless of what one does, they’ll likely need some form of web publishing. Either as APIs, interactive dashboards, web apps, etc. And then one will most likely need a DB, for things ranging from basic features to authentication, user preferences, etc. To be forced to resort to a different tech stack just because your language doesn’t support SQL DB access, well, that’s quite limiting and unproductive.


#5

I totally agree that DB support is not where it needs to be. In my current work being a CS person on a team with a bunch of EEs I am spreading the good word of SQL. Scientists and engineers spend a lot of time doing things that are easy in SQL (see pandas join/merge/query functions). Lack of good DB libraries in Julia forced a change of language on this project.

If MATLAB came with a good DB interface without a separate license fee, a lot more scientists would know and use SQL.

Web Development is entirely Database driven because the needs of ACID are so important in multiuser systems. We really can’t be a general purpose programming language if we can talk to relational algebra systems.

If relational algebra got the same effort and creativity as linear algebra we would be a leader in that field too.


#6

It’s great to voice your opinion and this is the appropriate place to do so. I agree that database support has always been in a state of probably “just working”; and as you noted, sometimes not.

This is still open source though, so it still boils down to people having enough time/motivation/resourced to work on the things that matter to them. In my experience that has usually meant diving in myself and doing the work that I’m going to heavily rely on (even if that involves heavy work in julia source code).

For me personally, I tend to do “rounds” of work on various packages and cycle through open issues and outstanding needs. I’ll take a package, go thru open issues, try to knock out as many as I can, implement a few features, add more tests/docs for a week or two and then move on to another package.

As a primary developer on the packages you mentioned, I’ll try to do better in responding and working on issues. Usually a “ping” on an issue helps when something may have slipped through initial email triaging. Providing as much detail on the system, exact setup/files/tables to reproduce issues, all that goes a long way to making bug fixing as easy as possible for those willing to dive in underneath.


#7

@quinnj I am aware of your work – I’m using many packages where you’re a top contributor. Including HTTP.jl – a very important package in the web development ecosystem. These observations were not about you (or any contributor). It’s not about guilting people into putting more effort into open source development. On the contrary, we need to support each other and be thankful. Your contributions are meaningful and valuable. It’s just that, like you said, it’s a lot of ground to cover.

It’s about raising an alarm and about looking for better solutions in a public forum of smart, like-minded people. The idea that anybody can contribute to just about anything is not always true. And in this case, judging by the results, it doesn’t seem to have gotten the desired output. Some people are good at and like to work at this level of the stack. Other people are better and more productive and can provide more value working on other Julia things. As it turns out, not many contributors showed up for these packages.

So maybe it’s time to look for other solutions. Maybe we can crowdfund the development of these packages? But even if we raise say $5K - $10K, who’s going to do the development? Who and how will manage the budgeting of the projects? Would you be able to take more (paid) work?

Or maybe one of the companies represented here can organize an internship to develop these. Or maybe they’d like to sponsor the effort with some tax-deductible money.

Also, if we find a good approach, this could be something we could repeat in order to support other packages of “public interest”.


#8

I share the sentiment, but I don’t see action points. Should we just spend a weekend on a single package and significantly improve it? I’ll have a couple of hours of spare time during the next 2 days, if you post 3-5 important tasks that don’t require too much previous experience in a package, I’m in.

A couple of tasks to consider (thinking of LibPQ.jl as an example):

  1. Make it work on Julia 0.7.
  2. Improve documentation, add more examples.
  3. Close some of easy, but time-consuming issues (e.g. 1, 2, 3).
  4. Test the library on your system, with your DB version to make sure new users don’t hit the wall.

Although in general I work with databases quite a lot, I don’t use them much in Julia, so it’s hard for me to prioritize, so suggestions are welcome.


#9

At times I wanted to contribute to Julia but when you don’t know what’s going on it’s not so easy, you can look up issues but it’s not always clear what’s to be done or if someone is already working on it. For that reason I feel like having someone preparing tasks, with a short but clear description would help. It needs to be someone that has a good overview of the issues in the domain, and the task need to be substantial enough that delegating them isn’t taking more time that doing them directly. For larger projects crowdfunding could work, but someone as to take the initiative and lead the way.


#10

The startup that I’d been consulting at for almost 3 years ended up moving off of Julia, in large part because of this issue and other similar ones (with string handling).
We’d been doing all of our backend in Julia for 2.5 years, the frontend is JS (TypeScript really) & Node.js.

It’s a very serious issue. Jacob has been my hero for over 3 years, he’s done more than anybody else AFAIK to move Julia forward in the areas that are important to me (see the end of my README for Strs.jl), but no matter how talented he is, we don’t yet have a cloning device, and he’s only one guy (yes, many other people contribute, but he’s the most prolific IMO)

Having spent 29 years with a language where the database was simply part of the language itself, I find it hard to get anything real done without a database of some sort (not necessarily SQL) available, for any sort of project, not just web development.


#11

I didn’t know about that! (pretty disturbing :P) Don’t they use things like dataframes apis in other languages to populate their frames from database tables?

In my case all data I’ve handled over the years, in any language I’ve used, for almost any task I’ve done; ultimately comes or goes to a production database.

<rant>
I have also stopped recommending Julia at work, because:

  • It’s ecosystem is not ready for general purpose programming, ie database support (understandable).
  • The core developers don’t respect self imposed deadlines (feature freezes) that we all depend on.
    • Julia 0.7/1.0 will be probably late for 1+ year.
    • At the time of supposed feature freeze, core developers were blunt about no more unimportant PRs and wasting their shared CI resources with non breaking changes, that would go for post 1.0. (to this day the feature freeze hasn’t reached freezing point and it’s becoming the norm in Julia’s development cycle)
    • They didn’t listen to everyone telling them it was unrealistic, and change schedules accordingly in an open way to let know the community what to expect (to this day we still joke about when it will be ready).
  • This in turn has halted development on much needed production ready libraries, debugger, even DataFrames/tabular data packages and Plotting (which I assume scientists do use a lot) were/are in a mess (haven’t checked up again).

Perhaps I will be able to recommend Julia 10 at work, when it comes out.
</rant>


I agree that this important libraries should have gotten funding too, it could still happen, I guess.

  • How can we see how the funding is being invested?
  • Are these funds for Julia Computing the company, or for Julia the open source project?
  • Is this data intended to be public or secret?
  • Who handles the monies? Moore foundation or Julia Computing?
  • How could we vote or propose a set of curated libraries that could be included in this funding scheme in case it is possible?

I will feel more assured if one day it is created a non profit “Julia Foundation”, independent from Julia Computing. (we are still relatively in the early days/years)

For me robust data base access is also a must, either if used directly or indirectly, from high or low level APIs, I second Essenciary that it should be almost on the same level as file i/o, but I am not a scientist.

If the sentence: “We want something as usable for general programming as Python”, was not mentioned in “Why we created Julia”, I wouldn’t have ever given Julia a try.

I understand focus is important, but I wonder what does the people in charge of the funding money think about promoting this kind of packages too.


#12

If Julia Computing or any organization / group of individuals with the skills to improve in this area put together a proposal, I think it is a great idea to crowdfund this. I’d be willing to contribute financially to a well-defined project proposal along these lines. I really want to use Julia for general purpose web applications. I can see the potential and it is beautiful, but there is still some basic work to do that I believe a few dedicated devs working over a relatively short time scale could crank out.


#13

Mostly, if you have a relatively small amount of data, reading from CSV to something like a DataFrame is a viable option. If you have lots of data, may you end up with something custom.

I am not saying that SQL does not have important benefits, especially in a business environment. But the setup & maintenance cost can be large, including learning something new.


#14

I suppose databases are crucial for many “production environments” where many users might read-write on a disc. Nevertheless I agree that most scientists can work without any databases. The biggest dataset I have used from kaggle was simply a set of folders withs lots of images containing around 200 GB of data. Most people I know work with far less data (specially once it’s clean).

I feel though that essenciary has risen an important topic to cover. Julia for science might not need a database ecosystem but if the language is supposed to be used as a “general purpose” programming language then it has to please and facilitate the work of many types of users.


#15

I think it’s reasonable to hold off on recommending Julia to many potential users right now, but don’t think this should be any kind of knock on the core development team – building the base language is a ton of work, and what they have undertaken is pretty ambitious. The folks who have been building higher level libraries have been working with a moving target, so they have their work cut out for them too. Getting to a mature ecosystem is still going to take a lot more time and a lot more work by a whole lot of people.


#16

I got reasonably good results using PyCall and cx_Oracle to access an Oracle database. Presumably the same strategy can be used for other types of databases. Alternatively, JDBC.jl would allow the broadest access to any kind of database.

Having said that, I do hope to see a more mature Julia database ecosystem. Having to rely on python or java binding makes things slow and complicates the installation process.


#17

I think that maybe you shouldn’t equate SQL with database. There are a lot of other fish in the sea!

For example, the database in Caché is totally integrated with the language (it’s a hierarchical database, you can use it as a simple key/value store, use it for object storage (without any ORM getting in the way), and also use SQL (which is mapped to the low-level distributed hierarchical database via fairly fast generated code).
The setup/maintenance cost for that is fairly minor, and learning to use it is as simple as learning to stick a “^” character in front of a variable reference to make it persistent/shared as opposed to non-persistent/process-private.

That sort of integration would not be possible with most languages, however with Julia, it was really not hard to do at all (you just have to love multiple dispatch, the type system, and the ease/speed of wrapping large C libraries :grinning:)


#18

Julia does have an independent foundation—it is called NumFocus. Donations to Julia at NumFocus only go to open source development and maintaining Julia project infrastructure. Julia Computing has never and will never take any money from open source donations to the Julia project. There may be a time in the future for a separate Julia Foundation, but creating a 501(3)c non-profit in the US is a lot of work and takes many years. Starting a foundation doesn’t at the moment seem like a good use of time and energy—NumFocus is a very good home for the project.

Perhaps there is some misunderstanding of how funding works. Who are “the people in charge of the funding money”? There is no big bag of money that some cabal of core Julia people decide to spend on whatever we want. People—and anyone can do this—apply for grants, and if one is granted, then you can use that money to fund the work you applied for the grant for. You can apply for grants from foundations like Moore and Sloan—they fund a lot of open source development that advances society through science. You can also apply for grants from government granting agencies; the US government maintains a list of granting agencies and supports a lot of research in science and technology. To get a grant, you have to submit an application to the organization in question and convince them that they should fund some work and that you can get the work you’re proposing done. You can write a proposal for whatever you want, but they also have to think it’s worth funding. After receiving a grant, you must use the money for what you proposed, you cannot use it for whatever you want unless the latitude to do so is built into the grant.

You could apply for a grant to improve Julia’s database connectivity support. If some grant organization was convinced to fund this, they would not give the money to you personally—you need some reputable organization to accept the money and administer it. NumFocus can play this role for open source Julia grants, so that would be a solid choice if you got such a grant. As a bonus, the grant money would be tax exempt since NumFocus is a 501(3)c, as long as you meet the requirement that the work fits NumFocus’s mission to support scientific open source software.

Aside on overhead: NumFocus takes 10% overhead out of grants and donations, which is how NumFocus pays for salaries and infrastructure. This is very low overhead; by comparison, overhead at universities is typically 55-60%—i.e. professors apply for research grants, and if they’re granted, the university takes ≥ 55% off the top and the remaining ≤ 45% is used to do the work. There is a common misapprehension that universities pay people to do research, which is not how it works: professors apply for research grants from granting agencies and universities take over half of that grant money, leaving the rest to be used for actual research. Effectively, universities are paid to allow research to be done at them; in turn they provide buildings, facilities, infrastructure, administration, and most of all legitimacy.

If you were to get a grant to work on database connectivity support in Julia and chose to receive the grant money through NumFocus, then the Julia project signatories with NumFocus would make sure that the money is used appropriately. I.e. that it is actually used for open source Julia development as intended, and not, say, buying a yacht or developing proprietary software to be sold for profit. Such uses would not only be fraud, but could cause NumFocus to lose its 501(3)c status since they are not eligible for tax exempt status. The NumFocus signatories for the Julia project who are responsible for oversight of the use of Julia funds are:

  • Jeff Bezanson
  • Tim Holy
  • Steve Johnson
  • Viral Shah
  • John Myles White

This set of people was chosen to be trustworthy, independent (not all from any one company or organization), and representative of the open source project. It could probably stand to be updated since John is no longer very actively involved and the community is much larger now than when Julia joined NumFocus. However it doesn’t matter much because the signatories do NOT get to decide what any money is used for—the funding agency and the grant applicant are the ones who work that out. All the signatories do is make sure that funding is being used for its intended purpose.

You may have noticed that Julia Computing does not enter the picture in any of this. Perhaps the confusion comes from the fact that Julia Computing has applied for and gotten some grants from the Moore and/or Sloan foundations to do open source Julia work. Those grants did not go through NumFocus because:

  • Julia Computing was the entity that applied for the grants;
  • all people involved in doing the work were Julia Computing employees;
  • Julia Computing is itself a grant-receiving organization which the foundations consider trustworthy, so passing the money through another organization (foundation or university) would only incur additional overhead for no benefit.

Those grants were given and used for specific purposes, including the development of Julia 1.0 and Pkg3. Julia Computing is directly responsible to the Moore and Sloan foundations for the use of those funds and provides regular reports on progress and results. Julia Computing cannot repurpose the money for other projects—it must be used to do the work for which it was granted.

It is possible that some granting agency would be interested in funding improved database connectivity support in Julia, but I kind of doubt it. That sort of work is definitely not research, so the government would not be interested in funding it. It is also squarely in the domain of for-profit companies, so foundations like Moore and Sloan are unlikely to fund it, unless the case can be made that it is crucial for scientific research and fills a gap that cannot otherwise be filled. That’s an argument that’s hard to make because companies are interested in databases and connecting to them, so the most likely source of funding is a company that wants Julia’s database connectivity to be better. It may well become a priority for Julia Computing in the future. At that point, however, we would not be using grant money to work on it, we would be using the company’s money because we believe it is good for business.


#19

Try a little enlarge point of view! :wink:

Julia could be useful for people who knows SQL, in environment where DB is already set and already part of ton of workflows, where transition to another DB or other solution is possible only step by step and would take several years.

So zero additional cost to installation or maintenance DB, but opportunity for Julia to show its power in data-transform/analyze area.


#20

One small correction, in case anyone is worried after hearing this. Generally for federal grants, the overhead is added on top of the “direct costs” (the budget) that the investigator applies for. So the overhead is more in the range of 1/3 of the total. Certain categories of expenses, such as high-end equipment, are also excluded from the overhead calculation.