My plan is to clean things up to the point that backends could implement physical operations over logical plans for themselves so that Volcanito.jl could focus only on the query language layer and logical plan optimizations. Even that will take me quite some time.
Nice, couple of options:
- TableOperations.jl
- Query.jl
- Volcanito.jl
- DataKnots.jl
Can you please clarify the differences? I am asking the question as a potential user who has little background in database languages and queries. So if someone can list the advantages and disadvantages of each approach, that would be super helpful. Also the previous question on vertical selections of rows would be helpful. Is it something achievable in these efforts?
Can you provide a more explicit example of how to compose the query macros? For example, how would I implement the following SQL query?
SELECT AVG(x), g
FROM tbl
WHERE y < 10
GROUP BY g
(Hopefully that’s correct SQL syntax. I don’t actually use SQL much…)
Edited to reflect @derekmahar’s comment below.
Query syntax is correct, aside from the pedantic detail that TABLE
is a keyword in SQL, so assuming a table named “table” exists, you’d have to surround its name in backticks like `table`
.
This is a good question and it touches on several subtle things that make creating this kind of API tricky. Here’s what I’d do to make the composition maximally explicit:
@aggregate_vector(
@group_by(
@where(tbl, y < 10),
g,
),
m = mean(x)
)
To see this in action:
julia> tbl = DataFrame(
x = rand(1:100, 10),
y = rand(1:20, 10),
g = rand(1:2, 10),
);
julia> @aggregate_vector(
@group_by(
@where(tbl, y < 10),
g,
),
m = mean(x)
)
2×2 DataFrame
│ Row │ g │ m │
│ │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 1 │ 79.0 │
│ 2 │ 2 │ 72.0 │
What makes this tricky:
- Like many other systems (e.g. dplyr), Volcanito encourages operations to be written in the order in which they need to be executed, which is not the SQL order. Many people eventually find they prefer this logical ordering after getting used to it, but it’s confusing at first.
- SQL’s
SELECT
behaves quite differently on grouped data than on ungrouped data and doing so faithfully would require something macros can’t offer – the ability to distinguishAVG
as an aggregation function rather than a scalar function at macro compilation time. This isn’t possible for many reasons, the most prominent being that such type-like information about names isn’t available at macro compilation time. Instead, in Volcanito, you don’t use@select
at all on grouped data – you use@aggregate_vector
directly. It captures the groups automatically, so you only express the aggregations you want to perform. - Working with columns whose names are not valid identifiers like the column called
AVG(x)
is tricky. My recommendation is to avoid this whenever possible and name your aggregations. In my example above, I named this aggregatem
.
With all that said, I would personally probably write this like using Pipe.jl or a similar package that makes the deep nesting less difficult to read without requiring naming intermediate variables:
julia> import Pipe: @pipe
julia> @pipe tbl |>
@where(_, y < 10) |>
@group_by(_, g) |>
@aggregate_vector(_, m = mean(x))
2×2 DataFrame
│ Row │ g │ m │
│ │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 1 │ 79.0 │
│ 2 │ 2 │ 72.0 │
I added a rough prototype of inner, left outer, right outer and full outer joins. It’s not well-tested, but hopefully gives some sense of what the API could look like (and how it could gradually evolve to handle joins that aren’t equi-joins).
Interesting, thanks!
This is actually the opposite of the impression I got from reading the README. I thought Volcanito.jl was leaning more towards SQL in the sense that it provides a declarative language for a query, and then under the hood there’s a query optimizer that decides the best way to run the query.
So, in a procedural framework like dplyr or DataFrames.jl, an inexperienced programmer might transform before filtering, like this:
df = DataFrame(a=1:4)
transform!(df, :a => ByRow(sqrt) => :b)
filter!(row -> row.a < 3, df)
But of course it’s probably more efficient to filter before transforming. In a purely declarative query syntax, that sort of rookie mistake shouldn’t actually be possible, since query optimization is handled automatically under the hood.
This is a good start! Will joins in Volcanito likely always or mostly rely on the corresponding implementation in the underlying table? This would imply, for example, that before Volcanito can implement non-equi-joins in DataFrames, DataFrames.jl must implement non-equi-joins.
By the way, how does type FunctionSpec
work? This type, which you use in the first parameter of method _predicate_to_pair
, caught my attention because I think it’s relevant to a discussion that I started recently about the limitations of type Function
in Julia function parameters.
My understanding is that there is some potential effort to implement joins independently of the table package. See https://github.com/JuliaData/DataFrames.jl/issues/2340
This is good feedback and I’ll clarify the README to address your questions.
This is actually the opposite of the impression I got from reading the README. I thought Volcanito.jl was leaning more towards SQL in the sense that it provides a declarative language for a query, and then under the hood there’s a query optimizer that decides the best way to run the query.
I’d say this is true, but I think these terms are abstract enough that there’s potential for misunderstanding. I think the distinction that matters isn’t declarative versus procedural per se – arguably DataFrames.innerjoin
is already declarative since it doesn’t describe the specific algorithm used like hash-join --, but rather laziness versus eagerness. Perhaps a useful analogy if you’re familiar with the context is the problem of matrix multiplication order optimization in a long chain of multiple matrix multiplications. If you have a lazy multiplication system, you could try to do reorderings once you know the sizes of the inputs to improve performance. If you do multiplication eagerly, you just end up with the ordering determined by the mixture of the language’s parsing strategy and the language’s function argument order evaluation strategy.
tl;dr: It’s laziness that gives you more opportunities for optimization, but it’s also laziness that can confuse users.
This is a good start! Will joins in Volcanito likely always or mostly rely on the corresponding implementation in the underlying table? This would imply, for example, that before Volcanito can implement non-equi-joins in DataFrames, DataFrames.jl must implement non-equi-joins.
Volcanito could implement more complex joins by generating Cartesian products and then a selection, but that’s likely to be quite slow. I think it’s unlikely I’ll have time to do an optimized non-equi-join implementation.
By the way, how does type
FunctionSpec
work? This type, which you use in the first parameter of method_predicate_to_pair
, caught my attention because I think it’s relevant to a discussion that I started recently about the limitations of typeFunction
in Julia function parameters.
I think viewing this as a function is probably misleading – I’m even likely to rename this to ExpressionSpec
at some point to avoid that confusion. FunctionSpec
is meant to be a “portable” representation of an input expression that captures the expression in multiple forms that could be relevant for different systems. For systems that process streams of tuples, it contains the expression translated into a function over tuples. But for systems that process raw Expr
objects, it also contains that.
A minor point:
Is there a strong theoretical reason for the name @aggregate_vector
? Otherwise I would prefer that it just be called @aggregate
.
Does your macro query syntax support window functions? For instance, in dplyr I can do the following:
> library(dplyr)
> df <- data.frame(a=rep(c(1, 2), each=4), b=1:8); df
a b
1 1 1
2 1 2
3 1 3
4 1 4
5 2 5
6 2 6
7 2 7
8 2 8
> df %>%
+ group_by(a) %>%
+ filter(b > mean(b))
# A tibble: 4 x 2
# Groups: a [2]
a b
<dbl> <int>
1 1 3
2 1 4
3 2 7
4 2 8
So the filtering happens on a per-group level and can return more than one row per group.
Honestly, I don’t know how to do that with DataFrames.jl anymore. I can’t keep up with all the changes to DataFrames.
Is there a strong theoretical reason for the name
@aggregate_vector
? Otherwise I would prefer that it just be called@aggregate
.
Yes, there’s a good theoretical reason: traditional databases only support aggregation functions that take in tuples at a time (rather than whole columns) so things like quantile(col1, 0.9)
don’t work. Traditional DB’s make heavy use of that restriction to avoid allocating so much memory and to enable better parallelism. I also want to do the same, so there’s going to be an @aggregate_scalar
in the near-ish future. At some point, I’ll probably decide one of the two is the one that deserves being called @aggregate
, but I don’t want to commit now and I think it’s useful to encourage users to appreciate the difference in the interim.
I don’t have an immediate plan for dplyr style window functions like:
> df %>%
+ group_by(a) %>%
+ filter(b > mean(b))
The core issue is that deciding that b > mean(b)
implies an implicit window function depends on either on some purely syntactic rules or more likely on run-time type information, which Volcanito doesn’t have at macro compilation time. I also think it depends on some strong assumptions that would be bad for Julia to make. In particular, what does a > mean(b)
indicate in the following case:
julia> df = DataFrame(
a = [2, 2, 2],
b = [[1, 2], [3, 4], [5, 6]],
)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Array… │
├─────┼───────┼────────┤
│ 1 │ 2 │ [1, 2] │
│ 2 │ 2 │ [3, 4] │
│ 3 │ 2 │ [5, 6] │
julia> @where(df, a > mean(b))
1×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Array… │
├─────┼───────┼────────┤
│ 1 │ 2 │ [1, 2] │
AFAICT dplyr doesn’t worry about that because the R community doesn’t usually work with tables that contain arrays as “scalars” that occur in each row, but other DB communities use nested data like that a lot. One goal I have in mind is that Volcanito should be trivially applicable to Apache Arrow data, which is heavily focused on supporting this kind of nested data since it’s ubiquitous in industry today.
I definitely want to support some sugar for this use case, but I think it’s going to be tricky to do well and would rather not do it badly now.
Honestly, I don’t know how to do that with DataFrames.jl anymore. I can’t keep up with all the changes to DataFrames.
The API is definitely more complex these days, but the performance is much improved since the times when I was the maintainer. I’m hopeful the performance can continue improving even if a simpler and more uniform API is layered over it.
I don’t quite see the difficulty in an expression like @where(df, a > mean(b))
. Given the row-wise semantics of Volcanito.@where
, I read that as "calculate the mean, μ
, of column b, and then filter out rows where a > μ
is not true. For the example data frame you gave,
julia> df = DataFrame(
a = [2, 2, 2],
b = [[1, 2], [3, 4], [5, 6]],
)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Array… │
├─────┼───────┼────────┤
│ 1 │ 2 │ [1, 2] │
│ 2 │ 2 │ [3, 4] │
│ 3 │ 2 │ [5, 6] │
@where(df, a > mean(b))
should throw an error, the same way this throws an error:
julia> b = [[1, 2], [3, 4], [5, 6]];
julia> a = 2;
julia> μ = mean(b)
2-element Array{Float64,1}:
3.0
4.0
julia> a > μ
ERROR: MethodError: no method matching isless(::Array{Float64,1}, ::Int64)
Closest candidates are:
isless(::Missing, ::Any) at missing.jl:87
isless(::AbstractFloat, ::Real) at operators.jl:158
isless(::Real, ::Real) at operators.jl:346
...
Stacktrace:
[1] <(::Array{Float64,1}, ::Int64) at ./operators.jl:268
[2] >(::Int64, ::Array{Float64,1}) at ./operators.jl:294
[3] top-level scope at REPL[42]:1
However, the row-wise semantics of Volcanito.@where
are pretty convenient, because it allows an expression like @where(df, all(a .> mean(b)))
to be used when the elements of the table are not scalars. In other words, the following ought to work:
julia> df = DataFrame(a=[[3, 4], [5, 6], [7, 8]], b=[[1, 2], [3, 4], [5, 6]])
3×2 DataFrame
│ Row │ a │ b │
│ │ Array… │ Array… │
├─────┼────────┼────────┤
│ 1 │ [3, 4] │ [1, 2] │
│ 2 │ [5, 6] │ [3, 4] │
│ 3 │ [7, 8] │ [5, 6] │
julia> @where(df, all(a .> mean(b)))
2×2 DataFrame
│ Row │ a │ b │
│ │ Array… │ Array… │
├─────┼────────┼────────┤
│ 1 │ [5, 6] │ [3, 4] │
│ 2 │ [7, 8] │ [5, 6] │
Here’s another window function use case to consider:
df = DataFrame(a=[1, 1, 1, 2, 2, 2], b = [6, 4, 5, 3, 1, 2])
@select(@group_by(df, a), a, b, c=sortperm(b))
(Actually, the @group_by
is somewhat irrelevant to what I’m about to say.) This currently doesn’t work because @select
uses row-wise semantics. I think it might make more sense for @select
to use column-wise semantics (and for @where
to continue using row-wise semantics). Thus the @select
example in the README would change to this:
@select(df, a, b, d = a .+ b)
@where(df, a > mean(b))
should throw an error, the same way this throws an error:
This seems wrong to me. Your example should likely be this AFAICT, which does work:
julia> b = [1, 2];
julia> a = 2;
julia> μ = mean(b)
1.5
julia> a > μ
true
I think it might make more sense for
@select
to use column-wise semantics
I’m not going to do that because it immediately rules out streaming computations that never materialize a whole column that can be acted upon atomically. Even columnar databases still encourage using SQL’s “one expression evaluation per row” model because that model allows better parallelism. I could imagine a @select_vector
if that pattern is sufficiently useful, but I think it’s a bad strategy if you value scalability or portability to other data formats.
Soon someone will write a sql_str macro and someone can then write
sql"“”
select * from abc join pdq where…“”"
and it’ll get compiled to Volcanito macros which then get optimized by a query planner and executed
I was hoping for exactly that
Ohhh, right. Yeah I goofed that up. I still can’t shake my dplyr habits.
This would definitely have it’s uses. Here’s another example with dplyr where @select_vector
would be needed:
> df <- data.frame(x = c(1, 5, 6))
> mutate(df, deviation = x - mean(x))
x deviation
1 1 -3
2 5 1
3 6 2
If one doesn’t have @select_vector
, the workaround is to aggregate your table and then join the aggregated table back to the original table, and then perform a @select
operation, but that seems a little tedious.
EDIT: After further thought, it seems that the only way you could join the aggregated table back to the original table (at least for the example above where there are no group ids) is to have a crossjoin
like the crossjoin
in DataFrames.jl.
EDIT #2: I suppose the simplest way to do the above in the Volcanito framework would be something like this:
df = DataFrame(x=[1, 5, 6])
μ = mean(df.x)
@select(df, x, deviation = x - $μ)
@johnmyleswhite This is really cool! Thanks for writing this package and thanks for documenting everything so well. I have a few thoughts on the package.
Implementation:
As far as I understand, this takes package works similar to DataFramesMeta. Given an expression
z = x + y
Volcanito makes three passes over the expression. One to replace locals, one to lift missin values, and one to apply two-values logic, i.e @where(df, a > 1, b < 2)
.
It saves the result of this transformation on the expression, along with the original syntax, without evaluating it directly. This is all the things in the FunctionSpec
struct. Then, we use this FunctionSpec
object in getn_column
, new_column
etc. This maps the function along tuples of columns
function gen_column(src, f, column_names)
map(f, tuple_iterator(src, column_names))
end
All the information in the above code fence is stored in FunctionSpec
.
The key part of Volcanito is that each FunctionSpec
is stored in a set of LogicalNodes
that can be re-ordered as needed for best performance.
You have a lot more experience than me, so I assume that the hardest part about this package is the “re-order logical nodes to optimize” part. My impression is that this is a “solved problem” since billions of dollars have been spent doing this for SQL.
Despite this working only with DataFrames currently, it’s easy to see how this can become source-agnostic and work with with Tables at large. You basically need to re-write materialize(op::Projection)
and columns_tuple
, gen_columns
to use Tables.jl
and TableOperations.jl
functions. This should be easy. You won’t even need packages to define a core set of functions to get things to work.
Future of Dataframesmeta:
My goals for the future of DataFramesMeta are actually quite different than this. To me, the macros @select
, @transform
, and @by
will be thin wrappes around DataFrames.transform
, DataFrames.select
, and DataFrames.combine
, respectively. The first steps at implementing this are in a PR to DataFamesMeta here. The implementation pretty straightforward, given an expression z = x + y
, construct [:x, :y] => fun => :z
, which is in the DataFrames “mini-language”.
This is kind of like your FunctionSpec
struct. The pair of pairs contains the source columns, the function, and then the result.
On the other hand, things are greedy, not lazy. I expect most of the optimizations to happen at DataFrames.jl, i.e. multi-threading. I’d be curious to hear your thoughts on how we can allow for Volcanito-style function re-ordering as we work to modernize DataFramesMeta.
A few syntactic comments:
- Using
x
instead of:x
in expressions. Downside, something like this fails
df = DataFrame(a = ["0.3"])
@select(df, b = tryparse(Float64, a))
it says :Float64
not in the data frame. But I appreciate that all expressions starting with :call
don’t have to be escaped.
- Operating row-wise. It’s often nice not to have to broadcast all functions.
df = DataFrame(x = [1, 2], y = [3, 4])
@select(y = x * y)
- Filtering out missing values in
@where
df = DataFrame(a = [missing, 1])
@where(df, a > .5)
There is a lengthy discussion in DataFrames.jl right now about how we should skip missing values to make things easiest for the user here. tldr, everyone agrees to this with filtering, but deciding consistent behvavior for @select
is really hard when working with full columns.
I see you have plans to lift
missing values here. It seems like you can use passmissing
from Missings.jl
for this? No need to re-invent a lift
function.
- Syntax for
locals
via$s
Because DataFramesMeta forces users to write :x
when referring to the column df.x
, DataFramesMeta doesn’t have to worry about locavariables yet. But if we decide to do
xinstead of
:x` this way of escaping seems like the obvious solution.
Thanks for all the comments! Glad you found the package interesting and I hope there’s parts you can steal as you finish reworking DataFramesMeta.
Responses to specific points you raised below.
The key part of Volcanito is that each FunctionSpec is stored in a set of LogicalNodes that can be re-ordered as needed for best performance.
Based just on lines of code in the current demo, the main functionality is in the creation of FunctionSpec
: the insight there is that you benefit a lot from capturing both the source expression and an anonymous function that evaluates that expression. Having the source makes it possible to do things like detect equi-joins when planning the algorithm to use for computing joins or realizing that a @where
expression is operating on a single column that might already have an index defined.
I also think there’s a good amount of value in doing things like automating three-valued logic and lifting. I might be wrong about lifting (especially since it may tax the compiler too much to insert so many compile-time deletable branches), but my experience is that people often develop, test and debug expressions that don’t use any missing values and then expect they’ll just work given missing values. Making that easier was one of my main goals and it let me reuse the TVL example from my macros tutorial.
You have a lot more experience than me, so I assume that the hardest part about this package is the “re-order logical nodes to optimize” part. My impression is that this is a “solved problem” since billions of dollars have been spent doing this for SQL.
First off, I should note that I’m very much not a DB optimization expert in the slightest. That said, although physical plan enumeration is certainly a well-studied problem, it’s not really solved since the basic problems like join reordering are not perfectly solvable in general and the publicly described techniques in the field are often built upon heuristic solutions. The seminal paper is https://www2.cs.duke.edu/courses/compsci516/cps216/spring03/papers/selinger-etal-1979.pdf.
On the other hand, things are greedy, not lazy. I expect most of the optimizations to happen at DataFrames.jl, i.e. multi-threading. I’d be curious to hear your thoughts on how we can allow for Volcanito-style function re-ordering as we work to modernize DataFramesMeta.
I think this is a tricky thing to do if you’re working eagerly. There’s lots of good reasons to prefer eagerness for usability, but it’s very hard to get optimization opportunities back if you can’t do any reorderings. And, unlike a system like PyTorch, I don’t think you’ll get a chance to run the “hot path” once to figure out the full query logical plan before you have to optimize – queries often run a single time.
Now on to responses to specific examples of code.
df = DataFrame(a = ["0.3"])
@select(df, b = tryparse(Float64, a))
I think this is an interesting example. To me, this failing feels very natural. I would tend to think that you’d want both of the following to work in the same way:
julia> df = DataFrame(a = ["0.3"])
1×1 DataFrame
│ Row │ a │
│ │ String │
├─────┼────────┤
│ 1 │ 0.3 │
julia> @select(df, b = tryparse($Float64, a))
1×1 DataFrame
│ Row │ b │
│ │ Float64 │
├─────┼─────────┤
│ 1 │ 0.3 │
julia> let ftype = Float64
@select(df, b = tryparse($ftype, a))
end
1×1 DataFrame
│ Row │ b │
│ │ Float64 │
├─────┼─────────┤
│ 1 │ 0.3 │
I could be more clever and extend the syntactic rules to assume that column names are never capitalized like type names, but to me this is a normal example of wanting to capture a scalar value that’s not a function. That you happen to be capturing a type rather than an arbitrary local variable seems less important to me – what matters is that you’re trying to pull in a local variable that can’t be trivially identified as such based on simple syntactic rules.
I’d extend your broadcasting example further, which I think emphasizes more sharply why I think row-wise operation is the right default:
julia> df = DataFrame(x = [[1, 2], [3, 4]], y = [[3, 4], [5, 6]])
2×2 DataFrame
│ Row │ x │ y │
│ │ Array… │ Array… │
├─────┼────────┼────────┤
│ 1 │ [1, 2] │ [3, 4] │
│ 2 │ [3, 4] │ [5, 6] │
julia> @select(df, y = x .* y)
2×1 DataFrame
│ Row │ y │
│ │ Array… │
├─────┼──────────┤
│ 1 │ [3, 8] │
│ 2 │ [15, 24] │
This is actually the best example I’ve seen yet for why I don’t like operating column-wise by default on purely syntactic grounds – you suddenly lose access to the normal meaning of broadcasting syntax.
There is a lengthy discussion in DataFrames.jl right now about how we should skip missing values to make things easiest for the user here 1. tldr, everyone agrees to this with filtering, but deciding consistent behvavior for @select is really hard when working with full columns.
This whole space was a topic of prolonged discussion between me and Simon Kornblith back in the era of DataArrays.jl. At the time, I was very averse to only including rows that evaluate to true
for a predicate in the underlying arrays without raising an error, but I think this is the right decision for consistency with SQL. I don’t always like SQL’s choices, but I think uniformity matters more these days.
I see you have plans to lift missing values here 1. It seems like you can use passmissing from Missings.jl for this? No need to re-invent a lift function.
This is a good note to me. I’ll avoid adding a lift
function, but I think I still need the use_default_lifting
system for users to indicate that they don’t want passmissing
to be used at all for their function with custom missing
handling like coalesce
.
Because DataFramesMeta forces users to write :x when referring to the column df.x, DataFramesMeta doesn’t have to worry about locavariables yet. But if we decide to doxinstead of:x` this way of escaping seems like the obvious solution.
I used to have less strong feelings about this, but I think that BenchmarkTools won me over to the idea that all macros in Julia should embrace the pun on Julia’s normal expression interpolation syntax and use it for local variable capture in macros. Once you accept that locals should be explicitly captured using $local
, I think using plain identifiers rather than symbols comes naturally. And you suddenly get back the normal syntax for symbols again to use in expressions.