A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Still feeling a bit lost in life due to mid-life crisis. So another idea.

Polars in the Python is very interesting as it has grown very fast and is able to handle large datasets BUT it’s written in Rust.

That means people wanting to extend it have a much higher technical bar to clear. I am the author of disk.frame in R and that was moderate successful package for working with larger-than-RAM data.

It think tech like Cobol and SAS are still around because they CAN handle larger than ram data. It’s just slow row by row processing.

I wanted to resurrect the DiskFrame brand in Julia and build a serious polars competitor.

The idea is that one can easily transition from Python to Julia so the sell is fast, and extensible targeted at the higher end of analytics where people need to move big data around but don’t quite need a large Spark cluster.

The amount of work is insane. So I need to save up enough to “retire” before I can do it, but I think it would be a really fun challenge to work on.

11 Likes

Sounds like an interesting project!

In principle there are funding opportunities for Julia project that could benefit a lot of people I think. But I don’t really know any specifics, perhaps @ChrisRackauckas can comment on that.

1 Like

I’m not sure what you’re referring to. If you’re a university professor who has been at the top of the field then you can apply for government research grants and tie that to open source development. You can get industry development grants (SBIR/STTR in the US) though you’ll need a good commercialization story to go along with it.

1 Like

I work in industry. Either I get to retire from my current job so I can focus on it or get some investment to work on it. But there needs to be a commercial story here.

What, specifically, does this thing in principle enable?

So for example, I work with Bayesian models. In general my problem isn’t data too big to fit in RAM, it’s parameter spaces too big to sample efficiently. In some cases I subsample data just so I can make the sampler go faster and be less constrained. Sometimes having 100M data points and finding out some parameter is say 2.80492 ± .00002 in 1 month of computing isn’t of interest compared to 2.80±.01 in 3 hrs

I’m not saying there’s no use case, I’m saying I don’t understand the value proposition. This from someone whose currently working on a model involving something like 500,000 rows of data from the Census, and who has worked on the complete ACS microdata in the past. (Maybe 100M rows in two tables, household and person)

I have actually started a project like this, but it is currently in a private repo. Also, back in March I pivoted to a different project called ExtendableInterfaces.jl (also in a private repo), so I haven’t been working on the table query project lately. However, I hope to release ExtendableInterfaces.jl in the next few months so that I can pivot back to the table query project.

I have a name for my table querying package, but the package is not registered yet so I’m a little hesitant to publicize the name. :joy:

Here is a summary of the high-level principles and goals of the project:

  • Lazy, declarative queries.
  • Queries are optimized by a query compiler.
  • Input tables are not mutated.
  • Works on any table implementing the interface in Tables.jl.

Additional goals of the project:

  • Execute queries on larger-than-memory data sets.
  • Distributed (big data) processing.
    • Longer term goal. Possibly with help from Dagger.jl.
  • Translate queries to SQL and send to databases?
    • This is not a top priority for me, but it could be done.

Additional details:

  • Queries are written in terms of relational algebra operators.
    • This differs from Polars where they also have the concept of column expressions, which I don’t like because they allow the user to accidentally scramble their data relations. For example, this Polars query would mess up your data:
    df.select(
        pl.col("a").sort(),
        pl.col("b").sort()
    )
    

I am definitely open to contributors, however for the initial release I want to retain tight control over the API and semantics. I have a clear vision for the project and I don’t want to spend months trying to develop consensus among the community on the right API and semantics. The API will be the API that I want. :slight_smile:

Anyhow, implementing the core API is the easy part. The hard part is writing the query compiler.

Regarding the commercialization story: Polars currently does not have a very good commercialization story. Their new company only advertises services like private consultations and priority on critical bugs, which is not a very compelling commercialization story in my opinion. A Spark replacement in Julia would have business value, but I’m not sure exactly how that would be commercialized. I’m not a business person.

4 Likes

This is actually very similar in scope to what I have in mind.

Push for the bigger than ram data because otherwise pandas can do that too.

Also put some solid ml algorithms in there then we have a product.

1 Like

Yeah, I think a Spark replacement probably has more commercial opportunity than a Pandas/Polars replacement. As you mentioned, adding in distributed ML algorithms can help. Spark has libraries for both distributed ML and distributed graph algorithms.

That being said, I want the library to be free and open-source, so I’m not sure exactly how commercialization would work. Some kind of cloud-computing services? Integration with JuliaHub? :joy:

2 Likes

Would it be helpful and even wanted if (assumed, since start-up) non-free, since there is GitHub - Pangoraw/Polars.jl: 🐻 Julia wrapper around the polars library (the expected wrapper for the Rust code, not to be confused with Polar.jl). You are likely limited to data types equivalent to Rust’s so not really limited?

Well that way Julia is neither limited, or any language(?). And there’s also Dagger.jl, and do you rather want to compete with Spark (that is free?) anrather than Polars?

1 Like

I think spark udf is either Scala or python. Python is slow and Scala is a niche programming language so not many data scientists will adopt it. Julia is fast and easy to pick up so extending the system by Julia is the key. It’s also niche but could go big.

2 Likes

Yes, fast user-defined functions is one of the advantages that a Julia solution would have over Polars or Spark.

How would any of this differ from DTables.jl? Aside from the argument re-arranging you implement.

First I don’t want this to be distributed. An ec2 instance can have like 2tb of ram. If u still need distribution then not in my use case.

Also DTables.jl how do I say. Doesn’t look. Accessible. The examples on the front page doesn’t really make much sense for the average data scientist.

Are there any requirements/goals that aren’t addressed by judicious use of DuckDB or SQLite?

1 Like

UDFs and incorporating arbitrary Julia code as part of the processing. Eg duckdb can be used in conjunction.

1 Like

DTables.jl doesn’t do query optimization. The current map and filter API in DTables.jl is not very conducive to query optimization. A lazy map (i.e. select) or filter operator needs to know exactly which columns are being operated on in order to enable various relational algebra expression rewrites. But with the current API, the columns that are operated on are hidden inside the opaque f that is passed to map or filter. Taking a row and returning a row in the map function also makes query optimization more challenging. Overall, it does not seem like DTables.jl was designed with query optimization in mind.

The package I am developing is primarily targeted at working with in-memory data, and secondarily targeted at working with larger-than-memory data. Distributed data is a distant third.

Ok. Large than ram data is my focus.

Also JuliaDB.jl. Nice idea but it died. No one wanted to pay for it.

Well, I’m implementing support for queries on in-memory data first, but support for queries on larger-than-memory data is an important feature that I definitely plan to implement. A tool like this should support optimized queries for both in-memory and larger-than-memory data, like Polars does.

In addition to fast, user-defined functions, as @xiaodai mentioned, there are at least two other advantages to being able to write queries in Julia. I use Polars at work, so I will provide examples in Polars, but the same could be done with Julia queries.

Code reuse

Code reuse in SQL is difficult, but it is easy in Polars. Here’s an example:

def shift_over(col_name, n):
    return (
        col(col_name)
        .shift(n)
        .over(
            partition_by = ["serial_number", "trip"],
            order_by = "timestamp"
        )
    )

df2 = df.with_columns(
    x_lag_1  = shift_over("a", 1),
    x_lead_1 = shift_over("a", -1),
    y_lag_1  = shift_over("b", 1),
    y_lead_1 = shift_over("b", -1)
)

Programmatically generate columns

It’s easy to programmatically generate new columns in Polars. Not so easy in SQL. Here’s an example in Polars that has both code reuse and programmatically generated columns:

def forward_circular_shift(col_name, n):
    return (
        col(col_name)
        .tail(n)
        .append(
            col(col_name).head(pl.len() - n)
        )
        .alias(col_name + f"_lag_{n}")
    )

df2 = (
    df
    .with_columns(
        [
            forward_circular_shift(col_name, n)
            for col_name in ["a", "b", "c", "d", "e"]
            for n in [1, 2, 3, 4, 5]
        ]
    )
)