I have actually started a project like this, but it is currently in a private repo. Also, back in March I pivoted to a different project called ExtendableInterfaces.jl (also in a private repo), so I haven’t been working on the table query project lately. However, I hope to release ExtendableInterfaces.jl in the next few months so that I can pivot back to the table query project.
I have a name for my table querying package, but the package is not registered yet so I’m a little hesitant to publicize the name. ![]()
Here is a summary of the high-level principles and goals of the project:
- Lazy, declarative queries.
- Queries are optimized by a query compiler.
- Input tables are not mutated.
- Works on any table implementing the interface in Tables.jl.
Additional goals of the project:
- Execute queries on larger-than-memory data sets.
- Distributed (big data) processing.
- Longer term goal. Possibly with help from Dagger.jl.
- Translate queries to SQL and send to databases?
- This is not a top priority for me, but it could be done.
Additional details:
- Queries are written in terms of relational algebra operators.
- This differs from Polars where they also have the concept of column expressions, which I don’t like because they allow the user to accidentally scramble their data relations. For example, this Polars query would mess up your data:
df.select( pl.col("a").sort(), pl.col("b").sort() )
I am definitely open to contributors, however for the initial release I want to retain tight control over the API and semantics. I have a clear vision for the project and I don’t want to spend months trying to develop consensus among the community on the right API and semantics. The API will be the API that I want. ![]()
Anyhow, implementing the core API is the easy part. The hard part is writing the query compiler.
Regarding the commercialization story: Polars currently does not have a very good commercialization story. Their new company only advertises services like private consultations and priority on critical bugs, which is not a very compelling commercialization story in my opinion. A Spark replacement in Julia would have business value, but I’m not sure exactly how that would be commercialized. I’m not a business person.