I want to announce a public demo release of a package I’ve been working on to test the viability of building a backend agnostic interface for tabular data in Julia that would let the ecosystem move towards an architecture closer to that used by databases. In homage to the classic Volcano model in the database literature, I’ve called the package Volcanito and you can check it out at GitHub - johnmyleswhite/Volcanito.jl: A backend agnostic for tabular data operations in Julia.
The package lets users write operations on data in terms of simple macros:
@select(df, a, b, d = a + b)
@where(df, a > b)
@aggregate_vector(
@group_by(df, !c),
m_a = mean(a),
m_b = mean(b),
n_a = length(a),
n_b = length(b),
)
@order_by(df, a + b)
@limit(df, 10)
These macros are translated into logical nodes that can be applied to arbitrary data sources.
25 Likes
Do you plan to add @join
?
Yes, I’ll get there at some point, but I’ve only had a little bit of time to work on this over my last week of vacation, so it might take a long time to get there. The goal here was to mostly to get something whose architecture is mature enough out into the public to influence future thinking in the space.
I have been thinking about something like this.
This is nice.
1 Like
How does this compare to Query.jl in Queryverse?
5 Likes
Query.jl seems to be row-semantic based where as Volcanito.jl have greater potential as column-nar manipulation library.
I see that it does lazy evaluation and only when show
is hit does it show the results. This seems like it will hit bottlenecks if the datasets are large as neither caching nor recomputing on the fly would be good solutions
Why would lazy evaluation hit bottlenecks for large datasets?
I don’t know. I’ve never made much effort to follow the Queryverse. I wrote this in part as an indication of how I would hope DataFramesMeta would evolve.
Why wouldn’t it? Every time u print it runs thru the same operations. Each operation might take 10 mins. Unless cached. But cache three operation might be huge cos the data is huge.
How would one compose these macros? For example, how would one compose @select
and @where
to select or reject certain rows which match the @where
condition?
Based on my understanding it builds a DAG of sorts and compiles that DAG to DataFrames.jl code.
1 Like
I follow what you mean. Why not, in addition to show
, introduce an @compute
or @result
operations that calculate and cache an intermediate result to which the program could apply additional lazy operations?
They already should compose. The example depends on composition:
@aggregate_vector(
@group_by(df, !c),
m_a = mean(a),
m_b = mean(b),
n_a = length(a),
n_b = length(b),
)
Your select and where example is equivalent; just replace df
in one of the expressions with the result from another operation.
If you read the docs, you see that this already exists: materialize
.
It looks like this is meant to allow a declarative query that can be optimized when executed. You are always free to execute the graph and then use the result in further operations.
1 Like
What’s the difference here from the materialize
operation used in the second part of the README?
Yes, you’re right. I overlooked the last sentence and last few lines of the second example. @xiaodai might have missed these, too.
Nothing. They are the same idea. I just didn’t see it until @jlapeyre pointed it out to me.
1 Like
Exactly what I am looking for @johnmyleswhite, thank you for the contribution. Could anyone please provide some comparison with the Query.jl package, what are the pros and cons of each approach?
Does any of them provide a row selector? I would like to slice Tables.jl tables vertically in a lazy fashion given indices for start and end rows, but couldn’t find a package to do this yet.
1 Like
Fantastic idea! I was kind of hoping this would be Vulcanito.jl (Vulcans | Star Trek) but you’re naming reasoning is much more sound.
What backends do you intend to support? DataKnots.jl?
1 Like