I want to announce a public demo release of a package I’ve been working on to test the viability of building a backend agnostic interface for tabular data in Julia that would let the ecosystem move towards an architecture closer to that used by databases. In homage to the classic Volcano model in the database literature, I’ve called the package Volcanito and you can check it out at https://github.com/johnmyleswhite/Volcanito.jl.
The package lets users write operations on data in terms of simple macros:
@select(df, a, b, d = a + b)
@where(df, a > b)
@aggregate_vector(
@group_by(df, !c),
m_a = mean(a),
m_b = mean(b),
n_a = length(a),
n_b = length(b),
)
@order_by(df, a + b)
@limit(df, 10)
These macros are translated into logical nodes that can be applied to arbitrary data sources.
Yes, I’ll get there at some point, but I’ve only had a little bit of time to work on this over my last week of vacation, so it might take a long time to get there. The goal here was to mostly to get something whose architecture is mature enough out into the public to influence future thinking in the space.
I see that it does lazy evaluation and only when show is hit does it show the results. This seems like it will hit bottlenecks if the datasets are large as neither caching nor recomputing on the fly would be good solutions
I don’t know. I’ve never made much effort to follow the Queryverse. I wrote this in part as an indication of how I would hope DataFramesMeta would evolve.
Why wouldn’t it? Every time u print it runs thru the same operations. Each operation might take 10 mins. Unless cached. But cache three operation might be huge cos the data is huge.
How would one compose these macros? For example, how would one compose @select and @where to select or reject certain rows which match the @where condition?
I follow what you mean. Why not, in addition to show, introduce an @compute or @result operations that calculate and cache an intermediate result to which the program could apply additional lazy operations?
If you read the docs, you see that this already exists: materialize.
It looks like this is meant to allow a declarative query that can be optimized when executed. You are always free to execute the graph and then use the result in further operations.
Exactly what I am looking for @johnmyleswhite, thank you for the contribution. Could anyone please provide some comparison with the Query.jl package, what are the pros and cons of each approach?
Does any of them provide a row selector? I would like to slice Tables.jl tables vertically in a lazy fashion given indices for start and end rows, but couldn’t find a package to do this yet.