Query.jl v0.11 released

davidanthoff · February 3, 2019, 1:48am

We just released Query.jl v0.11. The new version adds four new standalone query operators: @select, @rename, @mutate and @unique. We also fixed a number of bugs and wrote a bunch of new documentation. Finally, the standalone versions of the query operators are no longer considered experimental, so feel free to use the versions that use the pipe operator |> without any worries that they might go away in the future. The rest of this post will showcase the new query operators we added.

I also want to use this place to introduce two new Queryverse.jl team members: Yijia and Tony. Tony is a first year undergrad at UC Berkeley. He is pursuing a Computer Science and Cognitive Science degree and intends to minor in Data Science. He worked on a variety of projects in the past, including making contributions to Google’s Blockly and OpenCV. Yijia is a first year undergraduate student at UC Berkeley. He is currently studying Pure Math and Computer Science. He has worked on Rewrite.jl in the past and attended juliacon 2018. Both have been core members of the Queryverse.jl team since September 2018, and almost all new features described in this post were created by them. Please join me in welcoming them to the julia community!

The `@select` query operator

The standalone @select query operator makes it easier to select a subset of columns from a source table and arrange them in some specified order. Note that the discussion in this section is about the standalone @select query operator, not the LINQ style version of @select.

In the most basic example, you can simply specify all the columns that you want to include in the result table. You can select columns either by their name, or their position.

Lets first create a DataFrame with some example data that we will use as the basis for most examples in this post:

julia> using Query, DataFrames

julia> df = DataFrame(firstCol=rand(10), colA=rand(10), colB=rand(10), colC=rand(10), lastCol=rand(10))
10×5 DataFrame
│ Row │ firstCol    │ colA      │ colB       │ colC      │ lastCol   │
│     │ Float64     │ Float64   │ Float64    │ Float64   │ Float64   │
├─────┼─────────────┼───────────┼────────────┼───────────┼───────────┤
│ 1   │ 0.0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │
│ 2   │ 0.984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │
│ 3   │ 0.452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │
│ 4   │ 0.551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │
│ 5   │ 0.0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │
│ 6   │ 0.837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │
│ 7   │ 0.701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │
│ 8   │ 0.626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │
│ 9   │ 0.000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │
│ 10  │ 0.365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │

Now we select the column named colB, the second column and the column named lastCol into a new table:

julia> df |> @select(:colB, 2, :lastCol)
10x3 query result
colB       │ colA      │ lastCol
───────────┼───────────┼──────────
0.164546   │ 0.726354  │ 0.0685362
0.00516908 │ 0.132284  │ 0.847543
0.443238   │ 0.421744  │ 0.518726
0.80987    │ 0.204774  │ 0.866429
0.760544   │ 0.362015  │ 0.810517
0.596101   │ 0.0614884 │ 0.230335
0.714068   │ 0.702575  │ 0.308963
0.573619   │ 0.246562  │ 0.148938
0.289367   │ 0.432564  │ 0.592507
0.932312   │ 0.666146  │ 0.684766

Note that we are achieving two things at the same time here: we are picking only a few of the columns from the source table, and we are rearranging the order of the columns.

There are a whole number of options available that make it easier to select multiple columns without specifying each individually.

First, we can use the standard julia range syntax to select consecutive columns:

julia> df |> @select(:colB : :lastCol, :colA)
10x4 query result
colB       │ colC      │ lastCol   │ colA
───────────┼───────────┼───────────┼──────────
0.164546   │ 0.419195  │ 0.0685362 │ 0.726354
0.00516908 │ 0.619361  │ 0.847543  │ 0.132284
0.443238   │ 0.581122  │ 0.518726  │ 0.421744
0.80987    │ 0.0977708 │ 0.866429  │ 0.204774
0.760544   │ 0.775301  │ 0.810517  │ 0.362015
0.596101   │ 0.0613467 │ 0.230335  │ 0.0614884
0.714068   │ 0.283429  │ 0.308963  │ 0.702575
0.573619   │ 0.869932  │ 0.148938  │ 0.246562
0.289367   │ 0.92641   │ 0.592507  │ 0.432564
0.932312   │ 0.352096  │ 0.684766  │ 0.666146

In that example, we first selected all columns between colB and lastCol (so that would be colB, colC and lastCol), and then we also included colA as the second argument to the @select call.

We can also use the range syntax with column position indices:

julia> df |> @select(3:5, :colA)
10x4 query result
colB       │ colC      │ lastCol   │ colA
───────────┼───────────┼───────────┼──────────
0.164546   │ 0.419195  │ 0.0685362 │ 0.726354
0.00516908 │ 0.619361  │ 0.847543  │ 0.132284
0.443238   │ 0.581122  │ 0.518726  │ 0.421744
0.80987    │ 0.0977708 │ 0.866429  │ 0.204774
0.760544   │ 0.775301  │ 0.810517  │ 0.362015
0.596101   │ 0.0613467 │ 0.230335  │ 0.0614884
0.714068   │ 0.283429  │ 0.308963  │ 0.702575
0.573619   │ 0.869932  │ 0.148938  │ 0.246562
0.289367   │ 0.92641   │ 0.592507  │ 0.432564
0.932312   │ 0.352096  │ 0.684766  │ 0.666146

We also provide a number of convenience functions for bulk column selection. For example, to select all columns that have a name that starts with col, we can write:

julia> df |> @select(startswith("col"))
10x3 query result
colA      │ colB       │ colC
──────────┼────────────┼──────────
0.726354  │ 0.164546   │ 0.419195
0.132284  │ 0.00516908 │ 0.619361
0.421744  │ 0.443238   │ 0.581122
0.204774  │ 0.80987    │ 0.0977708
0.362015  │ 0.760544   │ 0.775301
0.0614884 │ 0.596101   │ 0.0613467
0.702575  │ 0.714068   │ 0.283429
0.246562  │ 0.573619   │ 0.869932
0.432564  │ 0.289367   │ 0.92641
0.666146  │ 0.932312   │ 0.352096

Or we can select all columns that have a name that ends with Col:

julia> df |> @select(endswith("Col"))
10x2 query result
firstCol    │ lastCol
────────────┼──────────
0.0815091   │ 0.0685362
0.984546    │ 0.847543
0.452009    │ 0.518726
0.551547    │ 0.866429
0.0537391   │ 0.810517
0.837029    │ 0.230335
0.701862    │ 0.308963
0.626259    │ 0.148938
0.000844883 │ 0.592507
0.365743    │ 0.684766

Lets say we have a situation where we want to select a large number of columns that we can easily select with one of the bulk column options discussed so far, except for one specific column. Here is how we can run such a query:

julia> df |> @select(startswith("col"), -:colB)
10x2 query result
colA      │ colC
──────────┼──────────
0.726354  │ 0.419195
0.132284  │ 0.619361
0.421744  │ 0.581122
0.204774  │ 0.0977708
0.362015  │ 0.775301
0.0614884 │ 0.0613467
0.702575  │ 0.283429
0.246562  │ 0.869932
0.432564  │ 0.92641
0.666146  │ 0.352096

Here we first select all columns that have a name that starts with col. At this point we have selected the columns colA, colB and colC. Next, we remove colB from this list, by specifying a minus - in front of its name as the next argument.

There is one twist to removing columns: if the first argument to the @select query operator starts with a minus, then we implicitly start out with all columns selected. Here is an example that demonstrates this:

julia> df |> @select(-:colB)
10x4 query result
firstCol    │ colA      │ colC      │ lastCol
────────────┼───────────┼───────────┼──────────
0.0815091   │ 0.726354  │ 0.419195  │ 0.0685362
0.984546    │ 0.132284  │ 0.619361  │ 0.847543
0.452009    │ 0.421744  │ 0.581122  │ 0.518726
0.551547    │ 0.204774  │ 0.0977708 │ 0.866429
0.0537391   │ 0.362015  │ 0.775301  │ 0.810517
0.837029    │ 0.0614884 │ 0.0613467 │ 0.230335
0.701862    │ 0.702575  │ 0.283429  │ 0.308963
0.626259    │ 0.246562  │ 0.869932  │ 0.148938
0.000844883 │ 0.432564  │ 0.92641   │ 0.592507
0.365743    │ 0.666146  │ 0.352096  │ 0.684766

Because we start this @select query operator with an argument that has a minus - in front, we start out with all columns, and then remove the colB column.

The minus - can of course also be combined with all the bulk column options. The next example selects all columns from the source table, except if they have a column name that starts with col:

julia> df |> @select(-startswith("col"))
10x2 query result
firstCol    │ lastCol
────────────┼──────────
0.0815091   │ 0.0685362
0.984546    │ 0.847543
0.452009    │ 0.518726
0.551547    │ 0.866429
0.0537391   │ 0.810517
0.837029    │ 0.230335
0.701862    │ 0.308963
0.626259    │ 0.148938
0.000844883 │ 0.592507
0.365743    │ 0.684766

If you are familiar with dplyr all of this will be highly familiar: we largely copied the semantics from that package.

The `@rename` query operator

The @rename query operator renames one or more columns from a table. Each rename is specified as a Pair, where the first element is either the old name of the column that should be renamed, or the position of the column. The second element must be a Symbol with the new name for the column. Here is an example:

julia> df |> @rename(:colB=>:foo, 1=>:bar)
10x5 query result
bar         │ colA      │ foo        │ colC      │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
0.0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362
0.984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543
0.452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726
0.551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429
0.0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517
0.837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335
0.701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963
0.626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938
0.000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507
0.365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766

Here we are renaming the colB column to foo, and the first column to bar. This example highlights that one can rename multiple columns in a single call to @rename.

All other columns in a source table are left unmodified by the @rename query operator, and the order of columns is also not changed.

The semantics of this query operator once again follow the equivalent dplyr function.

The `@mutate` query operator

The @mutate query operator can be used to either add columns to a table, or mutate the content of some columns in a table. Lets say we want to add a new column to our table that is the sum of column colA and the log of column colB:

julia> df |> @mutate(newCol = _.colA + log(_.colB))
10x6 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol   │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼────────────
0.0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │ -1.07821
0.984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │ -5.13278
0.452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │ -0.391904
0.551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │ -0.00610799
0.0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │ 0.0882928
0.837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │ -0.455857
0.701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │ 0.365798
0.626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │ -0.309229
0.000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │ -0.807497
0.365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │ 0.596058

We pass each column we want compute for the output table as a keyword argument to the @mutate query operator. The name of the keyword argument will become the name of the column that we are computing. To the right of the equal sign = we pass an expression that computes the value for the current row for the new column. We have access to the current row via the _ syntax (in the same way it is used in all the other standalone query operators). In our example this allows us for example to reference the value of colA and colB in the current row _. If the output name for the column (in our case newCol) does not exist as a column in the input table, a new column with that name will be added at the end of the table.

The next example uses an output name for the column that already exists, in which case the content of that column will be replaced with the new, computed values:

julia> df |> @mutate(firstCol = _.colA + log(_.colB))
10x5 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
-1.07821    │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362
-5.13278    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543
-0.391904   │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726
-0.00610799 │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429
0.0882928   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517
-0.455857   │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335
0.365798    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963
-0.309229   │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938
-0.807497   │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507
0.596058    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766

One can also specify multiple output columns in a single call to @mutate, for example the following code replaces the values in one existing column (named firstCol) and adds a new column named newCol.

julia> df |> @mutate(firstCol = _.colA + log(_.colB), newCol=_.colC * 2)
10x6 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol   │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼─────────
-1.07821    │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │ 0.838391
-5.13278    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │ 1.23872
-0.391904   │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │ 1.16224
-0.00610799 │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │ 0.195542
0.0882928   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │ 1.5506
-0.455857   │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │ 0.122693
0.365798    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │ 0.566858
-0.309229   │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │ 1.73986
-0.807497   │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │ 1.85282
0.596058    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │ 0.704192

Note that in all of these examples, @mutate leaves any column that is not named explicitly as an output column alone.

And it might not come as a surprise at this point: we again followed the lead of the equivalent dplyr function in the design.

The `@unique` query operator

The @unique query operator is simple: it removes any duplicate values from an input sequence. Here is a simple example where we remove duplicate values from an array of Int values:

julia> [1,2,3,4,5,2,4] |> @unique()
?-element query result
 1
 2
 3
 4
 5

One can of course also use this with tabular data:

julia> df2 = DataFrame(a=[1,2,3,2], b=[4,3,2,3])
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 3     │
│ 3   │ 3     │ 2     │
│ 4   │ 2     │ 3     │

julia> df2 |> @unique()
3x2 query result
a │ b
──┼──
1 │ 4
2 │ 3
3 │ 2

Conclusion

As always, please let us know about any issues or problems that you might run into while using Query.jl over in the github repository.

thautwarm · February 3, 2019, 5:30am

Hey David,
It’s cool to bring this pretty complete implementation of C# query syntaxes into Julia, but after reading the docs I found there is not a chance to make custom extension methods like:

@query_extension function custom_operation(df, arg1, arg2)
    ...
end

df |> 
@custom_operation(arg1, arg2) |>
...

I have such an implementation about this at https://github.com/thautwarm/MLStyle-Playground/blob/master/Linq.jl ,

import Linq: dispatch

dispatch(arr :: Vector{T}, ::Val{:str}) where T = string(arr)

dispatch(s :: String, ::Val{:println}) where T = println(s)

@linq [1, 2, 3].map(x -> 2x).str.println
# => [2, 4, 6]

Could you please take this into consideration? Also I strongly recommend you to use MLStyle to manipulate ASTs like rewriting query expressions.

davidanthoff · February 3, 2019, 6:27am

after reading the docs I found there is not a chance to make custom extension methods like

I’m not entirely sure I understood the comment properly, but the general pattern is that anything that can consume an iterator and is itself an iterator can just play along in the pipes that one would create with the standalone query operators. So if you wanted to create a custom query operator, that should be pretty straightforward. There is a small twist in that ideally you would use getiterator from GitHub - queryverse/IteratorInterfaceExtensions.jl: Traits for julia iterators, but that is really just a minor tweak to this story.

Your example code with dispatcher seems to really cover the extension method story from C#, and I guess it is less clear to me what role that would play in julia, given that we have multiple dispatch…

But, I might also just have misunderstood your comment entirely

Also I strongly recommend you to use MLStyle to manipulate ASTs like rewriting query expressions.

That looks great! I completely agree that the LINQ style query translation code right now is really quite horrible… At some point @bramtayl started rewriting that using MacroTools.jl, but so far that hasn’t made it over the finishing line. If someone wants to tackle that, I would love to see a PR! I don’t think we (the Berkeley team) will invest much more time in a rewrite of the part of the code, though. Yes, it is not very elegant, but on the other hand it works and is pretty complete, and so I think our time is better spent on other parts of the project.

I think on both points, maybe it would make sense to continue disucssion in issues in the github repo? They both are quite advanced internal design questions that are probably of less interest to the users of the package, so they might be best hosted on github.

thautwarm · February 3, 2019, 6:49am

Fair point.
I’ll be glad to discuss them with you in the issues. I’m about to make some PRs, your package makes me kind of excited.

leethargo · February 3, 2019, 1:50pm

Great and thanks! I guess with the new @select and @mutate, many queries will be more concise, because one no longer needs to repeat all the columns than one wants to keep as was done in @map.

thautwarm · February 3, 2019, 5:43pm

There’re so many compatibility issues have to be taken into considerations. Could we just give up on versions that’re lower than v1.0?

davidanthoff · February 3, 2019, 7:40pm

Yes, exactly, that is really the theme of this release! Things like modifying just one column in a table were really cumbersome with just @map in the past, but should be super easy now.

We are not entirely done with this yet, i.e. we still have a couple more ideas around the general theme of how to make it easier to deal with columns. One thing I hope we can tackle for the next release is a splatting syntax for named tuples, so that things like {i..., a=foo(i,j), j...} will work. Something like that would then be translated into merge(i, (a=foo(i, j),), j). That should be useful for things like joins.

davidanthoff · February 4, 2019, 5:21pm

I assume this refers to dropping support for julia versions lower tan v1.0? Right now the only additional version supported is julia v0.7, and at least in theory that should be identical to julia 1.0?

Topic		Replies	Views
Query.jl v0.7x released Community announcement	5	1645	September 12, 2017
Query.jl v0.8x released Community announcement	0	986	November 21, 2017
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24038	April 27, 2024
SQLite: query with duplicate column names \|> DataFrame Data dataframes , sqlite	15	5293	March 29, 2022
Query.jl - Return all columns with @map Data question	3	1225	August 5, 2019

Query.jl v0.11 released

The @select query operator

The @rename query operator

The @mutate query operator

The @unique query operator

Conclusion

Related topics

The `@select` query operator

The `@rename` query operator

The `@mutate` query operator

The `@unique` query operator