[WIP] Announcing Volcanito.jl: a backend-agnostic interface for tabular data

Query.jl seems to be row-semantic based where as Volcanito.jl have greater potential as column-nar manipulation library.

I see that it does lazy evaluation and only when show is hit does it show the results. This seems like it will hit bottlenecks if the datasets are large as neither caching nor recomputing on the fly would be good solutions

Why would lazy evaluation hit bottlenecks for large datasets?

I don’t know. I’ve never made much effort to follow the Queryverse. I wrote this in part as an indication of how I would hope DataFramesMeta would evolve.

Why wouldn’t it? Every time u print it runs thru the same operations. Each operation might take 10 mins. Unless cached. But cache three operation might be huge cos the data is huge.

How would one compose these macros? For example, how would one compose @select and @where to select or reject certain rows which match the @where condition?

Based on my understanding it builds a DAG of sorts and compiles that DAG to DataFrames.jl code.

1 Like

I follow what you mean. Why not, in addition to show, introduce an @compute or @result operations that calculate and cache an intermediate result to which the program could apply additional lazy operations?

They already should compose. The example depends on composition:


@aggregate_vector(
    @group_by(df, !c),
    m_a = mean(a),
    m_b = mean(b),
    n_a = length(a),
    n_b = length(b),
)

Your select and where example is equivalent; just replace df in one of the expressions with the result from another operation.

If you read the docs, you see that this already exists: materialize.
It looks like this is meant to allow a declarative query that can be optimized when executed. You are always free to execute the graph and then use the result in further operations.

1 Like

What’s the difference here from the materialize operation used in the second part of the README?

Yes, you’re right. I overlooked the last sentence and last few lines of the second example. @xiaodai might have missed these, too. :wink:

Nothing. They are the same idea. I just didn’t see it until @jlapeyre pointed it out to me. :slight_smile:

1 Like

Exactly what I am looking for @johnmyleswhite, thank you for the contribution. Could anyone please provide some comparison with the Query.jl package, what are the pros and cons of each approach?

Does any of them provide a row selector? I would like to slice Tables.jl tables vertically in a lazy fashion given indices for start and end rows, but couldn’t find a package to do this yet.

1 Like

Fantastic idea! I was kind of hoping this would be Vulcanito.jl (https://www.startrek.com/database_article/vulcans) but you’re naming reasoning is much more sound.

What backends do you intend to support? DataKnots.jl?

1 Like

My plan is to clean things up to the point that backends could implement physical operations over logical plans for themselves so that Volcanito.jl could focus only on the query language layer and logical plan optimizations. Even that will take me quite some time.

2 Likes

Nice, couple of options:

  • TableOperations.jl
  • Query.jl
  • Volcanito.jl
  • DataKnots.jl

Can you please clarify the differences? I am asking the question as a potential user who has little background in database languages and queries. So if someone can list the advantages and disadvantages of each approach, that would be super helpful. Also the previous question on vertical selections of rows would be helpful. Is it something achievable in these efforts?

1 Like

Can you provide a more explicit example of how to compose the query macros? For example, how would I implement the following SQL query?

SELECT AVG(x), g
FROM tbl
WHERE y < 10
GROUP BY g

(Hopefully that’s correct SQL syntax. I don’t actually use SQL much…)

Edited to reflect @derekmahar’s comment below.

Query syntax is correct, aside from the pedantic detail that TABLE is a keyword in SQL, so assuming a table named “table” exists, you’d have to surround its name in backticks like `table`.

1 Like

This is a good question and it touches on several subtle things that make creating this kind of API tricky. Here’s what I’d do to make the composition maximally explicit:

@aggregate_vector(
    @group_by(
        @where(tbl, y < 10),
        g,
    ),
    m = mean(x)
)

To see this in action:

julia> tbl = DataFrame(
           x = rand(1:100, 10),
           y = rand(1:20, 10),
           g = rand(1:2, 10),
       );

julia> @aggregate_vector(
           @group_by(
               @where(tbl, y < 10),
               g,
           ),
           m = mean(x)
       )
2×2 DataFrame
│ Row │ g     │ m       │
│     │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 79.0    │
│ 2   │ 2     │ 72.0    │

What makes this tricky:

  • Like many other systems (e.g. dplyr), Volcanito encourages operations to be written in the order in which they need to be executed, which is not the SQL order. Many people eventually find they prefer this logical ordering after getting used to it, but it’s confusing at first.
  • SQL’s SELECT behaves quite differently on grouped data than on ungrouped data and doing so faithfully would require something macros can’t offer – the ability to distinguish AVG as an aggregation function rather than a scalar function at macro compilation time. This isn’t possible for many reasons, the most prominent being that such type-like information about names isn’t available at macro compilation time. Instead, in Volcanito, you don’t use @select at all on grouped data – you use @aggregate_vector directly. It captures the groups automatically, so you only express the aggregations you want to perform.
  • Working with columns whose names are not valid identifiers like the column called AVG(x) is tricky. My recommendation is to avoid this whenever possible and name your aggregations. In my example above, I named this aggregate m.

With all that said, I would personally probably write this like using Pipe.jl or a similar package that makes the deep nesting less difficult to read without requiring naming intermediate variables:

julia> import Pipe: @pipe

julia> @pipe tbl |>
           @where(_, y < 10) |>
           @group_by(_, g) |>
           @aggregate_vector(_, m = mean(x))
2×2 DataFrame
│ Row │ g     │ m       │
│     │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 79.0    │
│ 2   │ 2     │ 72.0    │
3 Likes