Combining MySQL and Query

Is there a way to execute a Query.jl style Query directly on a MySQL database? Given that MySQL.jl gives an iterable tuple wiht MySQLRowIterator(con, command) I thought something like this could work:

@from i in MySQLRowIterator(con, command) begin
       @select i[1]
       @collect
end

However there are two issues:

  • It doesn’t infer type: I’m getting an Array{Any, 1} as a result
  • MySQLRowIterator actually iterates tuples rather than named tuples which makes it a bit harder to have it interact with Query.jl

Is there a way to get an iterator of named tuples from a MySQL.jl query to be combined with a Query.jl query in a type stable way?

One advantage would be to use the (in my opinion more polished/powerful) Query syntax directly on the remote database, without having to learn SQL.

EDIT: should have looked harder before posting here, I’ve just noticed https://github.com/davidanthoff/IterableTables.jl/issues/55

There are two levels of integration that we could aim for: 1) at the TableTraits.jl level and 2) at the Query.jl level.

The 1) option should be pretty straightforward. Essentially we need to solve the two issues you identified, but that is pretty simple. I mainly just wanted to finish some under the hood refactoring with the iterable tables stuff, which is done now, so I might take a crack at this soon.

But 1) is actually pretty inefficient. For example, if you select a column via Query.jl, you will still transfer all the data from the database into main memory, just to discard it right away. It would be much more efficient to select that column in the SQL query itself, so that the DB engine never even loads any of the columns that you don’t need. This points to 2): Query.jl is actually designed in such a way that you can write a Query.jl query, and if the source is say a MySQL database, it will translate that Query.jl query into a SQL statement and then execute the whole query inside the database. There is one caveat: someone needs to write a Query.jl backend for MySQL that actually does this translation step, that part doesn’t exist yet. I have a very rudimentary backend for SQLite that can translate one very narrow type of Query.jl into SQL, more as a proof of concept that I got the architecture right. In generally, writing a backend is neither a trivial nor a small project and I probably won’t have time to pursue that in the near term. I might try to find some folks to work on this going forward, either here on campus or this would also make a nice Google Summer of Code project for next year.

2 Likes

Thanks a lot for the thorough explanation! There’s a plan to make StatPlots.jl work with any table-like structure (not just DataFrames, as it is now), ideally only depending on IterableTables (not on Query) and I was thinking that plotting directly from a MySQL source could be common enough and it’d be nice to support it. It only needs to support extracting the columns that are relevant for the plot into some arrays, which in my mind would be basically (let’s say we want to plot xcolumn against ycolumn):

iter = getiterator(df) # df could also be SQL source
type_info = Dict(zip(column_names(iter), column_types(iter)))
x = Array{type_info[:xcolumn]}(0)
y = Array{type_info[:ycolumn]}(0)
for i in getiterator(df)
    push!(x, i.xcolumn)
    push!(y, i.ycolumn)
end

If I understand correctly, this is possible (and potentially soon to be implemented also for MySQL), and reasonably efficient for a DataFrame (as all the data is already in memory) but not so much for SQL as we load a lot of unnecessary stuff. Making it more efficient requires:

  • using Query
  • quite a bit of work

What I’m curious about is, could this example application be optimized directly in IterableTables? That is to say, would it make sense to also add an extra method:

getiterator(df, cols)

which would only iterate named tuples with the columns selected in cols (while not even downloading the remaining columns / reading the remaining columns it it’s from a CSV, etc), or do you believe that this already counts as a query and should be part of the larger Query to SQL statement translation project?

Yes, integrating iterable tables with StatPlots.jl would be great. I had a go at that a while ago, but in the end stopped because it was not super trivial and I felt I needed to change one thing with the packaging of the iterable tables interface. That is actually done now, so would be cool if we could revisit that jointly. Let me know where a good place is to discuss that and I’ll join and help.

Maybe I overstated the inefficiency: unless you deal with a lot of data, I’m sure folks won’t even notice this issue, in general iterable tables is pretty efficient. So I almost feel we should just start with the iterable tables integration for MySQL.jl, that would be a huge boost in things folks can do, and anything else could be done later.

I’ve been thinking about your idea from the last paragraph a fair bit lately. There are other situations where something similar applies: for example, when I load a CSV file with CSVFiles.jl and then have a query that only uses a subset of columns, it would be ideal if the CSV parser actually already skips the columns that aren’t needed. My gut feeling is that this would be best achieved by having CSVFiles.jl be a custom Query.jl backend that can handle a very small subset of queries itself, and then would hand any other, more complex query off to the default iterator based implementation. But I’m still mulling over this scenario… In general, I’d like to keep the iterable tables interface absolutely minimal, simple and stable (ideally I won’t change anything about it for a long time), I think that is really key for making it succeed widely.

Your opinion/help is extremely welcome! The plan is to support IterableTables, via a macro that would take care of replacing symbols in the plot call with the respective column from the table. On the StatPlots README there are some examples of how to use the macro and this is the PR with IterableTables support:

https://github.com/JuliaPlots/StatPlots.jl/pull/82

Interestingly, it was noted in the PR that adding support to JuliaDB has been an open issue for about two years now (with an optimistic easy - up for grabs label…).

In general I agree that it’s probably best to go for the low hanging fruit of adding MySQL support via IterableTables. The plotting macro only needs to do one pass over the data where it selects some columns, the time to do this is probably neglectable compared to actually drawing the plot.