A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

xiaodai · September 15, 2024, 12:13pm

Still feeling a bit lost in life due to mid-life crisis. So another idea.

Polars in the Python is very interesting as it has grown very fast and is able to handle large datasets BUT it’s written in Rust.

That means people wanting to extend it have a much higher technical bar to clear. I am the author of disk.frame in R and that was moderate successful package for working with larger-than-RAM data.

It think tech like Cobol and SAS are still around because they CAN handle larger than ram data. It’s just slow row by row processing.

I wanted to resurrect the DiskFrame brand in Julia and build a serious polars competitor.

The idea is that one can easily transition from Python to Julia so the sell is fast, and extensible targeted at the higher end of analytics where people need to move big data around but don’t quite need a large Spark cluster.

The amount of work is insane. So I need to save up enough to “retire” before I can do it, but I think it would be a really fun challenge to work on.

abraemer · September 15, 2024, 1:15pm

Sounds like an interesting project!

In principle there are funding opportunities for Julia project that could benefit a lot of people I think. But I don’t really know any specifics, perhaps @ChrisRackauckas can comment on that.

ChrisRackauckas · September 15, 2024, 2:03pm

I’m not sure what you’re referring to. If you’re a university professor who has been at the top of the field then you can apply for government research grants and tie that to open source development. You can get industry development grants (SBIR/STTR in the US) though you’ll need a good commercialization story to go along with it.

xiaodai · September 15, 2024, 2:32pm

I work in industry. Either I get to retire from my current job so I can focus on it or get some investment to work on it. But there needs to be a commercial story here.

dlakelan · September 15, 2024, 2:55pm

What, specifically, does this thing in principle enable?

So for example, I work with Bayesian models. In general my problem isn’t data too big to fit in RAM, it’s parameter spaces too big to sample efficiently. In some cases I subsample data just so I can make the sampler go faster and be less constrained. Sometimes having 100M data points and finding out some parameter is say 2.80492 ± .00002 in 1 month of computing isn’t of interest compared to 2.80±.01 in 3 hrs

I’m not saying there’s no use case, I’m saying I don’t understand the value proposition. This from someone whose currently working on a model involving something like 500,000 rows of data from the Census, and who has worked on the complete ACS microdata in the past. (Maybe 100M rows in two tables, household and person)

CameronBieganek · September 15, 2024, 3:07pm

I have actually started a project like this, but it is currently in a private repo. Also, back in March I pivoted to a different project called ExtendableInterfaces.jl (also in a private repo), so I haven’t been working on the table query project lately. However, I hope to release ExtendableInterfaces.jl in the next few months so that I can pivot back to the table query project.

I have a name for my table querying package, but the package is not registered yet so I’m a little hesitant to publicize the name.

Here is a summary of the high-level principles and goals of the project:

Lazy, declarative queries.
Queries are optimized by a query compiler.
Input tables are not mutated.
Works on any table implementing the interface in Tables.jl.

Additional goals of the project:

Execute queries on larger-than-memory data sets.
Distributed (big data) processing.
- Longer term goal. Possibly with help from Dagger.jl.
Translate queries to SQL and send to databases?
- This is not a top priority for me, but it could be done.

Additional details:

Queries are written in terms of relational algebra operators.
- This differs from Polars where they also have the concept of column expressions, which I don’t like because they allow the user to accidentally scramble their data relations. For example, this Polars query would mess up your data:
```
df.select(
    pl.col("a").sort(),
    pl.col("b").sort()
)
```

I am definitely open to contributors, however for the initial release I want to retain tight control over the API and semantics. I have a clear vision for the project and I don’t want to spend months trying to develop consensus among the community on the right API and semantics. The API will be the API that I want.

Anyhow, implementing the core API is the easy part. The hard part is writing the query compiler.

Regarding the commercialization story: Polars currently does not have a very good commercialization story. Their new company only advertises services like private consultations and priority on critical bugs, which is not a very compelling commercialization story in my opinion. A Spark replacement in Julia would have business value, but I’m not sure exactly how that would be commercialized. I’m not a business person.

xiaodai · September 15, 2024, 3:22pm

This is actually very similar in scope to what I have in mind.

Push for the bigger than ram data because otherwise pandas can do that too.

Also put some solid ml algorithms in there then we have a product.

CameronBieganek · September 15, 2024, 3:30pm

Yeah, I think a Spark replacement probably has more commercial opportunity than a Pandas/Polars replacement. As you mentioned, adding in distributed ML algorithms can help. Spark has libraries for both distributed ML and distributed graph algorithms.

That being said, I want the library to be free and open-source, so I’m not sure exactly how commercialization would work. Some kind of cloud-computing services? Integration with JuliaHub?

Palli · September 15, 2024, 6:43pm

Would it be helpful and even wanted if (assumed, since start-up) non-free, since there is GitHub - Pangoraw/Polars.jl: 🐻 Julia wrapper around the polars library (the expected wrapper for the Rust code, not to be confused with Polar.jl). You are likely limited to data types equivalent to Rust’s so not really limited?

Well that way Julia is neither limited, or any language(?). And there’s also Dagger.jl, and do you rather want to compete with Spark (that is free?) anrather than Polars?

xiaodai · September 15, 2024, 9:26pm

I think spark udf is either Scala or python. Python is slow and Scala is a niche programming language so not many data scientists will adopt it. Julia is fast and easy to pick up so extending the system by Julia is the key. It’s also niche but could go big.

CameronBieganek · September 15, 2024, 10:30pm

Yes, fast user-defined functions is one of the advantages that a Julia solution would have over Polars or Spark.

pdeffebach · September 15, 2024, 10:53pm

How would any of this differ from DTables.jl? Aside from the argument re-arranging you implement.

xiaodai · September 15, 2024, 11:06pm

First I don’t want this to be distributed. An ec2 instance can have like 2tb of ram. If u still need distribution then not in my use case.

Also DTables.jl how do I say. Doesn’t look. Accessible. The examples on the front page doesn’t really make much sense for the average data scientist.

jocklawrie · September 16, 2024, 1:27am

Are there any requirements/goals that aren’t addressed by judicious use of DuckDB or SQLite?

xiaodai · September 16, 2024, 2:21am

UDFs and incorporating arbitrary Julia code as part of the processing. Eg duckdb can be used in conjunction.

CameronBieganek · September 16, 2024, 4:02am

DTables.jl doesn’t do query optimization. The current map and filter API in DTables.jl is not very conducive to query optimization. A lazy map (i.e. select) or filter operator needs to know exactly which columns are being operated on in order to enable various relational algebra expression rewrites. But with the current API, the columns that are operated on are hidden inside the opaque f that is passed to map or filter. Taking a row and returning a row in the map function also makes query optimization more challenging. Overall, it does not seem like DTables.jl was designed with query optimization in mind.

The package I am developing is primarily targeted at working with in-memory data, and secondarily targeted at working with larger-than-memory data. Distributed data is a distant third.

xiaodai · September 16, 2024, 4:18am

Ok. Large than ram data is my focus.

xiaodai · September 16, 2024, 5:14am

Also JuliaDB.jl. Nice idea but it died. No one wanted to pay for it.

CameronBieganek · September 16, 2024, 12:38pm

Well, I’m implementing support for queries on in-memory data first, but support for queries on larger-than-memory data is an important feature that I definitely plan to implement. A tool like this should support optimized queries for both in-memory and larger-than-memory data, like Polars does.

CameronBieganek · September 16, 2024, 2:46pm

In addition to fast, user-defined functions, as @xiaodai mentioned, there are at least two other advantages to being able to write queries in Julia. I use Polars at work, so I will provide examples in Polars, but the same could be done with Julia queries.

Code reuse

Code reuse in SQL is difficult, but it is easy in Polars. Here’s an example:

def shift_over(col_name, n):
    return (
        col(col_name)
        .shift(n)
        .over(
            partition_by = ["serial_number", "trip"],
            order_by = "timestamp"
        )
    )

df2 = df.with_columns(
    x_lag_1  = shift_over("a", 1),
    x_lead_1 = shift_over("a", -1),
    y_lag_1  = shift_over("b", 1),
    y_lead_1 = shift_over("b", -1)
)

Programmatically generate columns

It’s easy to programmatically generate new columns in Polars. Not so easy in SQL. Here’s an example in Polars that has both code reuse and programmatically generated columns:

def forward_circular_shift(col_name, n):
    return (
        col(col_name)
        .tail(n)
        .append(
            col(col_name).head(pl.len() - n)
        )
        .alias(col_name + f"_lag_{n}")
    )

df2 = (
    df
    .with_columns(
        [
            forward_circular_shift(col_name, n)
            for col_name in ["a", "b", "c", "d", "e"]
            for n in [1, 2, 3, 4, 5]
        ]
    )
)

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6830	July 13, 2017
ANN: JuliaDB.jl Community	40	9856	November 13, 2018
DataTables or DataFrames? Data question	32	15491	November 19, 2018
Does the concept of type-stability apply to DataFrames or Tables? Data dataframes , tables , type-stability	21	5861	October 17, 2017
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7523	August 21, 2020

A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Code reuse

Programmatically generate columns

Related topics