A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

I have actually started a project like this, but it is currently in a private repo. Also, back in March I pivoted to a different project called ExtendableInterfaces.jl (also in a private repo), so I haven’t been working on the table query project lately. However, I hope to release ExtendableInterfaces.jl in the next few months so that I can pivot back to the table query project.

I have a name for my table querying package, but the package is not registered yet so I’m a little hesitant to publicize the name. :joy:

Here is a summary of the high-level principles and goals of the project:

  • Lazy, declarative queries.
  • Queries are optimized by a query compiler.
  • Input tables are not mutated.
  • Works on any table implementing the interface in Tables.jl.

Additional goals of the project:

  • Execute queries on larger-than-memory data sets.
  • Distributed (big data) processing.
    • Longer term goal. Possibly with help from Dagger.jl.
  • Translate queries to SQL and send to databases?
    • This is not a top priority for me, but it could be done.

Additional details:

  • Queries are written in terms of relational algebra operators.
    • This differs from Polars where they also have the concept of column expressions, which I don’t like because they allow the user to accidentally scramble their data relations. For example, this Polars query would mess up your data:
    df.select(
        pl.col("a").sort(),
        pl.col("b").sort()
    )
    

I am definitely open to contributors, however for the initial release I want to retain tight control over the API and semantics. I have a clear vision for the project and I don’t want to spend months trying to develop consensus among the community on the right API and semantics. The API will be the API that I want. :slight_smile:

Anyhow, implementing the core API is the easy part. The hard part is writing the query compiler.

Regarding the commercialization story: Polars currently does not have a very good commercialization story. Their new company only advertises services like private consultations and priority on critical bugs, which is not a very compelling commercialization story in my opinion. A Spark replacement in Julia would have business value, but I’m not sure exactly how that would be commercialized. I’m not a business person.

4 Likes