A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

CameronBieganek · September 15, 2024, 3:07pm

I have actually started a project like this, but it is currently in a private repo. Also, back in March I pivoted to a different project called ExtendableInterfaces.jl (also in a private repo), so I haven’t been working on the table query project lately. However, I hope to release ExtendableInterfaces.jl in the next few months so that I can pivot back to the table query project.

I have a name for my table querying package, but the package is not registered yet so I’m a little hesitant to publicize the name.

Here is a summary of the high-level principles and goals of the project:

Lazy, declarative queries.
Queries are optimized by a query compiler.
Input tables are not mutated.
Works on any table implementing the interface in Tables.jl.

Additional goals of the project:

Execute queries on larger-than-memory data sets.
Distributed (big data) processing.
- Longer term goal. Possibly with help from Dagger.jl.
Translate queries to SQL and send to databases?
- This is not a top priority for me, but it could be done.

Additional details:

Queries are written in terms of relational algebra operators.
- This differs from Polars where they also have the concept of column expressions, which I don’t like because they allow the user to accidentally scramble their data relations. For example, this Polars query would mess up your data:
```
df.select(
    pl.col("a").sort(),
    pl.col("b").sort()
)
```

I am definitely open to contributors, however for the initial release I want to retain tight control over the API and semantics. I have a clear vision for the project and I don’t want to spend months trying to develop consensus among the community on the right API and semantics. The API will be the API that I want.

Anyhow, implementing the core API is the easy part. The hard part is writing the query compiler.

Regarding the commercialization story: Polars currently does not have a very good commercialization story. Their new company only advertises services like private consultations and priority on critical bugs, which is not a very compelling commercialization story in my opinion. A Spark replacement in Julia would have business value, but I’m not sure exactly how that would be commercialized. I’m not a business person.

Topic		Replies	Views
What's the latest and greatest in data in Julia Data	29	2379	August 15, 2024
Future directions for DataFrames.jl Data package , dataframes	47	6791	June 3, 2022
Struggling with Julia and large datasets General Usage question , big-data	67	11543	October 17, 2024
Direct interface to Polars Rust library Data question	13	1844	November 9, 2023
How is the data ecosystem right now for large datasets? Data	35	6922	July 13, 2017

A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Related topics