Might the community find helpful a curated table somewhere of database adapters and what state they are in? For example, LibPQ.jl is pretty good state, but other PostgreSQL adapters arn’t. This would probably be very helpful for newcomers to know about. I’m not in the position to construct this table. Perhaps it could be a lovely update to the top post of this thread if there’s no other appropriate place to put it? Just a thought.
Would like to just chime in here that in Bioinformatics, and social sciences like Econ or demographics research, or GIS, or a variety of other areas of science access to big databases of public data is fairly crucial. I worked on a project in R where subsampling the Census American Community Survey (ACS) was needed, and have friends doing Bioinformatics where access to big datasets of organized genomics data is critical.
In general the pattern of querying a big data store to get a large table of data, doing analysis, maybe writing back a big table and then merging tables together to do the next step is far more prevalent in science than patterns common in web dev, such as hundreds or thousands of times a minute checking for auth tokens or grabbing a cached state etc
Julia libs should be capable of reading 2 million rows rapidly and memory efficiently for someone to be able to do a massive survey analysis, or reconstruct the whole sequence of the 4th mouse chromosome with annotations for genes for example.
I don’t think there is any disagreement that Julia would benefit from mature libraries, nor is there a shortage of feature requests.
The key question here is whether there are willing contributors to realize these goals.
If someone is looking for a key-value storage: There is LevelDB.
Well, I read the whole thread just this morning, and was struck by:
My main point was to mention that there are different patterns of data access and storage between people doing analysis of large datasets, vs say web development. What works well for web development, and lots of small queries might fall over easily if you ask it to join 13 tables together and provide you with 3.5 million rows so you can analyze county-by-county poverty measures, or do analysis of 3.5 billion base-pairs.
There wasn’t much here in the thread by people who access SQL databases for science purposes so I thought to mention it.
If someone finds SQL useful, they should go for it.
Frankly, I am always puzzled when someone argues that some feature or library is “critical” for someone. This is unlikely to make anyone else implement it, if they don’t otherwise need it themselves.
My point was that this is not a choice that the “community” needs to be convinced about, because the bottleneck is contributions, which are most likely to happen by people who find these features useful in the first place.
Note, I’m not complaining, or arguing that anything is critical, just pointing out the kinds of technical demands on how SQL is used daily in certain areas of science.
My experience with SQLite.jl has been that it’s perfectly adequate for data science problems. I was using it for exactly the problem you were (querying the ACS) and it performed fairly well.
One problem that I’ve noticed (but that I haven’t had time to look into improving yet) is that it seems there is a bottleneck in importing large data into memory from a SQLite table. It’s manageable, but I prefer to use it mostly for operations where I do most of my calculations/aggregations in SQL and only then bring relatively little data back into Julia. What I’ve been doing more recently for working with large data is to just stream it to a Feather file, and then just manipulate a dataframe of Arrow-backed arrays. It works beautifully, and the deserialization is way faster than SQLite tables.
This is good to hear. How does MySQL/Mariadb work in this kind of application? Have you had a chance to try?
No I haven’t. I’m MySQL has always seemed like there was too much overhead to figure out compared to SQLite, so I’ve mostly just stuck to that. If people think that it works better for this kind of stuff, maybe I’ll have to reconsider. (I’m watching this thread with interest )
Well, for example I got the entire microdata of the ACS, and then it’s a gazillion bytes of CSV, one file per table per year… I wanted one table across the whole year sequence… MySQL CONNECT tables allowed me to query directly from the CSV to import into my one big table… it easily saved me a week of munging (For those who aren’t familiar, each ACS yearly release asks potentially different questions and therefore has different numbers and order of columns, and comes with a PDF that describes the data… munging this by hand is a nightmare, munging it in SQL a lot nicer. If you can take the CSV and just pretend it’s a data table, which CONNECT lets you do, then it’s maximally easy)
Is MySQL “better” unconditionally, no, but it definitely can do some different stuff that can be useful. One thing it does well is let more than one person access the same big data store. SQLite not so much.
I pull hundreds of millions of rows from MS SQL Server using ODBC.jl. It works but is definitely something I’d like to optimize better.
Our situation right now is a lot better than it looks on the surface, but certainly the fact that one needs to build Julia to use JDBC (and Spark, for that matter) is a significant blow. If you are using a specific database, you have some pretty good options
- As far as I can tell, ODBC.jl works just fine and is actively maintained primarily by the industrious @quinnj, who valiantly does a lot of the work that, frankly, many of us just really don’t want to do. It is however, in my opinion, a big pain to set up (because of ODBC, not because of ODBC.jl).
- SQLite.jl also seems to be actively maintained and working fine, though I don’t think I’ve ever used it.
- MySQL.jl also seems to be actively maintained and working fine.
- LibPQ.jl (for postgres) is actively maintained and working fine.
- JDBC.jl is actively maintained and works fine only on pre Julia 1.1. This renders the package dead to those of us who insist on upgrading Julia and for whom building Julia is impractical for whatever reason (mostly deployment, in my case).
Something very important that we are lacking is a universal interface for these packages. It is therefore not trivial to swap one for another, which is a sever limitation in some cases. Loading data from these packages (as opposed to executing queries) does have a universal interface in the form of Tables.jl (we also ahve @quinnj to thank for that) so at least the hard part is already done. Julia does have a significant advantage here: Tables.jl is much more well-thought-out than the analogous pieces of, for example, ODBC, JDBC or SQLAlchemy, for which loading data seems to sort of be an after thought (though, to be fair, in SQLAlchemy’s case I suspect the package was designed mainly with ORM in mind). Unfortunately because there is no uniform way to make queries, Tables.jl alone is not enough to render your database code generic, and one still has to make a concious effort to wrap functions that talk to databases.
I myself had made some bold claims in the past about my own efforts to develop a universal database interface. I wound up completely abandoning the idea because, firstly, at the time I was largely transitioning to using Postgres in most places at my job, so my need for this decreased significantly. Second, in Julia’s case such an interface would require separate PR’s to lots of disparate packages. We are still nowhere near having a comprehensive SQL interface package like JDBC or SQLAlchemy (which would be a huge amount of mind-numbing work, that nobody should ever suffer through). The good news is, this wouldn’t be a huge amount of work, since, as I’ve said, Tables.jl takes care of the hard part. The bad news is that someone would have to come along and actually do it, and coordinate with all existing packages for it to be any good.
For me, the remaining problem right now isn’t the big names like Postgres are MySQL, it’s the other miscellaneous ones like (especially) presto, MS SQL, or even AWS Athena (which fortunately is accessible through AWSCore.jl). The larger looming problem is that now Spark.jl is locked behind the JavaCall issue, and we seem to be increasingly living in a world where, for better or worse, everything is going to be done with Spark (might be better than some existing SQL setups at least, but every time I read something about Spark my opinion of it decreases, so I’m not overly enthused about this new world).
My one other (or two other?) two-cents here is that, indeed, in the Julia community most of us have scientific backgrounds and attitudes toward SQL range from mild dislike in the IR to passionate loathing in the UV (I myself am somewhere in the X-ray region). Nobody wants to work on it, and many of us wish we could get the hell away from it in the first place. I have endless frustration with things that have no business being in tabular formats being mangled into tabular formats. It doesn’t help that some people around me treat non-tabular formats like some unholy abomination, even when properly storing their data tabularly involves something horribly like JSON columns.
This is all really interesting. A very naive question, any reason no one mentioned JuliaDB?
JuliaDB isn’t a traditional relational database — it’s more of an analytical tool to help you do computation over large, structured datasets.
That said, these days a lot of people are just dumping data into some parquets somewhere and using Spark and presto to interact with them as an alternative to a SQL database (I’m still totally unclear on how adding data to existing sets is supposed to work).
If someone wants share a detailed example of how something like this might work with JuliaDB I’m sure we’d all be highly interested. (I still don’t think we can write parquet’s from Julia.)
please don’t speak for the scientific community. sure some loath SQL, but lots of science people like databases. dealing with tabular data outside a database is as painful as dealing with nontabular data in a DB. reducing the “status” of DB by disparaging blanket remarks isn’t going to help anyone. Not trying to be argumentative here, just encouraging people to express individual rather than universal opinions
In my original comment I was more claiming to be speaking for the Julia community than the scientific communtiy, though of course your point is as valid there: I shouldn’t try to speak for others, especially in internet forums where it’s hard to convey a not-overly-serious tone.
That said, the comment was in direct response to much of what has been the main discussion on this thread: we don’t have better database support because the Julia community isn’t overly interested in working on it. I don’t mean to minimize the excellent work of those who have taken on some part of this task, but at this point it’s plain to see where a lot of the interest is in the Julia community. I’m not saying this just to be difficult, this seems like an important commentary. For example, perhaps it makes more sense to mitigate the difficulties in using database packages through Python more than it does to hope for contributions which might not come for a long time in some cases.
I take issue with this. Tables can be expressed either as a dict of arrays or an array of dicts. Both of these are expressed perfectly well by all sorts of hierarchical formats, however more general hierarchical formats tend to be expressed very poorly by tabular formats. In fact, lately I’ve been thinking that what I really need is more general join
and groupby
methods akin to what is found in Query.jl, perhaps with some convenient filter
and sorting methods. At that point it would start to become very easy not to force anything to be in specifically tabular formats.
Sure, that’s good to point out, I’d just add so far, it seems to me like a lot of pain comes from trying to develop connections to existing software in a language that is rapidly evolving in its implementation. Maybe that’s a little less of an issue now that we’ve got 1.1.x? Not sure. I’m pretty new to Julia, I’ve been watching it on the sidelines for about 3 years, but ready to start using it for things. Having access to databases would make that easier for me, so I’m interested in the state of DB access. I work with people who have big databases sometimes, like bioinformatics, or econ or whatever. It sounds like a few others are lurking around with similar interests.
I’d be willing to write some SQL benchmark tests to create, select, munge, subselect, etc stuff in a way that’s “typical” of data-science access patterns if anyone who works on DB stuff would find that kind of test-cases useful.
Just my experience:
Unfortunately, I was not able to use ODCB.jl on production. It had some problems with my setup. I could fix one, but this other was, well, way above my pay grade
However, Julia being Julia, I managed to very easily use pyodbc and do what I need to do