[ANN] SQLdf - SQL for Julia DataFrames

Hi guys,

If you like me happen to be accustomed to use SQL to explore and prepare tabular data then SQLdf might be of your interest.

Since we don’t have as yet a native Julia package to SQL Julia DataFrames I would typically use the excellent RCall package to use the R/sqldf library, however for certain queries there were one too many steps I had to go through and I thought I would just package them away in SQLdf.

A quick example follows:

import Pkg
Pkg.add("SQLdf")

using SQLdf
T = DataFrame(a=1:14, b=14:-1:1, c = split("Julia is great",""))
S = DataFrame(a=1:14, c=split("Julia is fast!",""))

sqldf("""
      select *  
      from T join S on T.b = S.a
      order by T.a
      """)

14×5 DataFrame
 Row │ a      b      c       a_1    c_1    
     │ Int64  Int64  String  Int64  String 
─────┼─────────────────────────────────────
   1 │     1     14  J          14  !
   2 │     2     13  u          13  t
   3 │     3     12  l          12  s
   4 │     4     11  i          11  a
   5 │     5     10  a          10  f
   6 │     6      9              9
   7 │     7      8  i           8  s
   8 │     8      7  s           7  i
   9 │     9      6              6
  10 │    10      5  g           5  a
  11 │    11      4  r           4  i
  12 │    12      3  e           3  l
  13 │    13      2  a           2  u
  14 │    14      1  t           1  J

Enjoy!

20 Likes

are you aware of Octo.jl? Some tie up of Octo.jl and SQLdf.jl would be nice. Even just for the syntax highlight.

BTW Is VScode capable of applying a different syntax highlighter for the code block sqldf"""

Also not turning up in google. So a link to the repo would be nice.

1 Like

Hi @xiaodai

It seems Octo’s goals are quite different, I created SQLdf just for quick interactions with Julia DataFrames, however it is also true that R/sqldf interacts with SQLite, H2, MySQL and PostgreSQL so you only need to set up your R environment to use those with SQLdf.

However, it’s better to use native Julia code when possible, that’s why I am not planning to extend SQLdf functionality beyond Julia DataFrames, and when a native Julia package allows me to interact with Julia DataFrames directly I will discontinue SQLdf.

Emacs person here, I don’t know about VScode but you can use multiple modes in Emacs though :slight_smile:

image

1 Like

Hopefully it works on Tables.jl tables, not only the DataFrames.jl package?

1 Like

I guess that if the implemenation on Tables.jl has column names it should work, have a go and let us know how it goes.

Tables.jl doesn’t have a single implementation, it’s a common interface for all of Julia’s many tabular structures. Home · Tables.jl

Yeah, as I said, if the implementation on Tables.jl has column names I guess it should work. Let’s take for example the TimeSeries.jl implementation.

using TimeSeries
using Dates
using SQLdf

dates = Date(2018, 1, 1):Day(1):Date(2018, 12, 31)
ta = TimeArray(dates, rand(length(dates)))

sqldf("select * from ta limit 5")

ERROR: MethodError: no method matching sexpclass(::TimeArray...

In this case it seems TimeArray does not directly provide a name for the timestamp column and RCall fails because it cannot transform it into a data.frame and back to a TimeArray.

ta = TimeArray(dates, rand(length(dates)))
365×1 TimeArray{Float64, 1, Date...
│            │ A      │
├────────────┼────────┤
│ 2018-01-01 │ 0.4664 │
│ 2018-01-02 │ 0.0072 │
│ 2018-01-03 │ 0.5835 │
│ 2018-01-04 │ 0.5457 │

I do not know what implementations based on Tables can be converted by RCall besides DataFrames, I am just guessing that those DataFrame like may work or they should be converted to a DataFrame to work.

Back to this example, if we still want to use SQLdf we need to transform the TimeArray into a DataFrame.

tad = DataFrame(ta)
sqldf("select * from tad limit 5")

5×2 DataFrame
 Row │ timestamp   A          
     │ Date        Float64    
─────┼────────────────────────
   1 │ 2018-01-01  0.466422
   2 │ 2018-01-02  0.00715373
   3 │ 2018-01-03  0.58346
   4 │ 2018-01-04  0.545692
   5 │ 2018-01-05  0.0297381

Therefore, the general answer for Tables implementations is no, it does not work on Tables.jl implementations unless, like DataFrames, they are supported by RCall.

1 Like

The DataFrame object is pretty light-weight. If conversion to Rcall is the limiting factor, there is no shame in calling DataFrame under the hood.

But it would be totally awesome if you could get RCall to convert to data.frame in R for all Tables.jl compatible objects.

DataFrames(x::T) can convert tables types into DataFrames, that’s not a problem, the problem is converting the query DataFrame back into the original type S<:T. ( Let’s take S<:T as notation for S type implements a Table interface).

We could return a query DataFrame but if one’s working with a type S<:T one wants the query to return a type S<:T as well. Also, since I don’t know how peculiar column naming might be for some Table types I thought best to let the user do the conversion to DataFrame.

We could though implement in SQSLdf the back and forward conversion for a specific type S<:T:

S<:T → DataFrame → Query DataFrame → S<:T

However, unless S<:T happens to be a mainstream type like DataFrame, I think is reasonable to kindly ask those interested in having such integration to provide a pull request with these steps.

DataFrames(x::T) would do that, but we would not get back an S<:T. That’s probably why RCall also lets to the user to do the conversion.

I’m now imagining a complete Julia based version of this that works something like

@sqldf (a,b,c,d "select c.*,d.* from a join b on ... join c on... order by")

It loads a,b,c,d into an in-memory sqlite database, and then executes the sql command and returns the value. This seems like it’d be really easy to program.

1 Like

This might not be possible in general (e.g. some Tables.jl tables can be memory mapped files). I think you can use the same pattern as CSV.jl uses in CSV.read, when the second argument is a constructor of a target table. As I assume SQLdf.jl will have DataFrames.jl as a dependency you can have DataFrame as a default (CSV.jl does not have DataFrames.jl as a dependency so one has to explicitly pass the second argument).

1 Like

There is no need to declare a,b,c,d since they’re already present in the query, but I would also love to see a full Julia optimized version, in the meanwhile…

CSV forces the user to call the constructor function for S<:T anyway and we would be only changing this:

S.constructor(sqldf("""select ..."""), args...)

for this:

sqldf("""select ...""", S.constructor, args...)

I don’t think there is any gain, specially considering that the second option still needs to deal with potential nameless columns in the Table type that we would solve by converting first to a DataFrame.

Yes but then you have to write a parser for SQL to extract them… That’s not easy

That’s exactly what SQLdf does :slight_smile:

You are parsing the SQL? Or the R code is?

I think it’s maybe a design mistake to avoid having the user of the macro provide the data frames of interest, in favor of parsing SQL to extract the table names in use and try to find data frames in the environment that match. For one thing, if the user explicitly provides the data frames, then the user knows exactly what’s “visible” in the SQL database, which can avoid some bugs where things work when there’s a global variable named foo but break if that global isn’t defined, or is redefined to be something other than a Table or a DataFrame or whatever. Second of all, you have to maintain a SQL parser, and keep it up to date with everything that the SQL implementation knows how to parse. So maybe someone writes some queries that use pragmas or other specialized stuff that SQLite allows, and do you parse it correctly? What a pain!

I’m going to take a stab at a Julia implementation because I think this is pretty interesting and I love to use sqldf in R. But I’m going to go with an explicitly provided set of tables.

I don’t think so, thanks for your comment though.

Definitely, different strokes I guess. I am taking a stab at my version.

1 Like

I’m not sure what you mean S <: T a table is not a type, just an interface, so there is no subtyping involved.

I think most custom types that implement the Tables interface should have a constructor for Tables, so you could just try typeof(t)(out) and if it works that’s cool.

Yeah, thanks for the clarification, by S<:T I meant a type S implementing a Table interface.

One might think so but, for instance, for TimeArray that would not work, and even if it did, the user would still have to guess the name for the DateTime column. With such approach we would have the query working sometimes and failing some more or worse; working halfway.

However, I don’t see this much of a problem because if there is a really popular Table based type like DataFrame that everybody wants to use SQL queries on it, no worries, I will probably do a specific integration for that type myself. As things stand right now there are a couple of dozens packages based on Tables and is probably not reasonable for SQLdf to keep track of them all as well as future implementations.

Thanks for your comments!