[WIP] Announcing Volcanito.jl: a backend-agnostic interface for tabular data

johnmyleswhite · August 28, 2020, 2:44pm

I want to announce a public demo release of a package I’ve been working on to test the viability of building a backend agnostic interface for tabular data in Julia that would let the ecosystem move towards an architecture closer to that used by databases. In homage to the classic Volcano model in the database literature, I’ve called the package Volcanito and you can check it out at GitHub - johnmyleswhite/Volcanito.jl: A backend agnostic for tabular data operations in Julia.

The package lets users write operations on data in terms of simple macros:

@select(df, a, b, d = a + b)

@where(df, a > b)

@aggregate_vector(
    @group_by(df, !c),
    m_a = mean(a),
    m_b = mean(b),
    n_a = length(a),
    n_b = length(b),
)

@order_by(df, a + b)

@limit(df, 10)

These macros are translated into logical nodes that can be applied to arbitrary data sources.

derekmahar · August 28, 2020, 3:08pm

Do you plan to add @join?

johnmyleswhite · August 28, 2020, 3:13pm

Yes, I’ll get there at some point, but I’ve only had a little bit of time to work on this over my last week of vacation, so it might take a long time to get there. The goal here was to mostly to get something whose architecture is mature enough out into the public to influence future thinking in the space.

xiaodai · August 28, 2020, 3:15pm

I have been thinking about something like this.

This is nice.

derekmahar · August 28, 2020, 3:17pm

How does this compare to Query.jl in Queryverse?

xiaodai · August 28, 2020, 3:28pm

Query.jl seems to be row-semantic based where as Volcanito.jl have greater potential as column-nar manipulation library.

xiaodai · August 28, 2020, 3:30pm

I see that it does lazy evaluation and only when show is hit does it show the results. This seems like it will hit bottlenecks if the datasets are large as neither caching nor recomputing on the fly would be good solutions

derekmahar · August 28, 2020, 3:35pm

Why would lazy evaluation hit bottlenecks for large datasets?

johnmyleswhite · August 28, 2020, 3:55pm

I don’t know. I’ve never made much effort to follow the Queryverse. I wrote this in part as an indication of how I would hope DataFramesMeta would evolve.

xiaodai · August 28, 2020, 4:09pm

Why wouldn’t it? Every time u print it runs thru the same operations. Each operation might take 10 mins. Unless cached. But cache three operation might be huge cos the data is huge.

derekmahar · August 28, 2020, 4:09pm

How would one compose these macros? For example, how would one compose @select and @where to select or reject certain rows which match the @where condition?

xiaodai · August 28, 2020, 4:12pm

Based on my understanding it builds a DAG of sorts and compiles that DAG to DataFrames.jl code.

derekmahar · August 28, 2020, 4:13pm

I follow what you mean. Why not, in addition to show, introduce an @compute or @result operations that calculate and cache an intermediate result to which the program could apply additional lazy operations?

johnmyleswhite · August 28, 2020, 4:35pm

They already should compose. The example depends on composition:


@aggregate_vector(
    @group_by(df, !c),
    m_a = mean(a),
    m_b = mean(b),
    n_a = length(a),
    n_b = length(b),
)

Your select and where example is equivalent; just replace df in one of the expressions with the result from another operation.

jlapeyre · August 28, 2020, 4:35pm

If you read the docs, you see that this already exists: materialize.
It looks like this is meant to allow a declarative query that can be optimized when executed. You are always free to execute the graph and then use the result in further operations.

johnmyleswhite · August 28, 2020, 4:35pm

What’s the difference here from the materialize operation used in the second part of the README?

derekmahar · August 28, 2020, 4:40pm

Yes, you’re right. I overlooked the last sentence and last few lines of the second example. @xiaodai might have missed these, too.

derekmahar · August 28, 2020, 4:40pm

Nothing. They are the same idea. I just didn’t see it until @jlapeyre pointed it out to me.

juliohm · August 28, 2020, 6:30pm

Exactly what I am looking for @johnmyleswhite, thank you for the contribution. Could anyone please provide some comparison with the Query.jl package, what are the pros and cons of each approach?

Does any of them provide a row selector? I would like to slice Tables.jl tables vertically in a lazy fashion given indices for start and end rows, but couldn’t find a package to do this yet.

anon92994695 · August 28, 2020, 7:51pm

Fantastic idea! I was kind of hoping this would be Vulcanito.jl (Vulcans | Star Trek) but you’re naming reasoning is much more sound.

What backends do you intend to support? DataKnots.jl?

Topic		Replies	Views
Common API for tabular data backends Data	44	2630	August 28, 2020
DataFrames.jl development survey Data question , dataframes	52	2943	September 27, 2020
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020
[ANN-RFC] DFMacros.jl Package Announcements dataframes	30	2008	June 19, 2021
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10530	April 22, 2022

[WIP] Announcing Volcanito.jl: a backend-agnostic interface for tabular data

Related topics