A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

CameronBieganek · September 16, 2024, 2:46pm

In addition to fast, user-defined functions, as @xiaodai mentioned, there are at least two other advantages to being able to write queries in Julia. I use Polars at work, so I will provide examples in Polars, but the same could be done with Julia queries.

Code reuse

Code reuse in SQL is difficult, but it is easy in Polars. Here’s an example:

def shift_over(col_name, n):
    return (
        col(col_name)
        .shift(n)
        .over(
            partition_by = ["serial_number", "trip"],
            order_by = "timestamp"
        )
    )

df2 = df.with_columns(
    x_lag_1  = shift_over("a", 1),
    x_lead_1 = shift_over("a", -1),
    y_lag_1  = shift_over("b", 1),
    y_lead_1 = shift_over("b", -1)
)

Programmatically generate columns

It’s easy to programmatically generate new columns in Polars. Not so easy in SQL. Here’s an example in Polars that has both code reuse and programmatically generated columns:

def forward_circular_shift(col_name, n):
    return (
        col(col_name)
        .tail(n)
        .append(
            col(col_name).head(pl.len() - n)
        )
        .alias(col_name + f"_lag_{n}")
    )

df2 = (
    df
    .with_columns(
        [
            forward_circular_shift(col_name, n)
            for col_name in ["a", "b", "c", "d", "e"]
            for n in [1, 2, 3, 4, 5]
        ]
    )
)

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6830	July 13, 2017
ANN: JuliaDB.jl Community	40	9856	November 13, 2018
DataTables or DataFrames? Data question	32	15489	November 19, 2018
Does the concept of type-stability apply to DataFrames or Tables? Data dataframes , tables , type-stability	21	5860	October 17, 2017
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7523	August 21, 2020

A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data

Code reuse

Programmatically generate columns

Related topics