Query.jl v0.7x released

I just released Query.jl version v0.7.1. The v0.7.x series includes a number of smaller improvements and some major new experimental features. This post describes and explains all new features included in this release.

Enable {} everywhere

{} is the syntax for named tuples in Query.jl. In previous versions this syntax only worked in @select statements, and it only worked at the top level (for example you couldn’t create a named tuple that has a field that is itself a named tuple). The new release now enables the {} syntax everywhere in queries.

This is especially handy for @group statements. Take the following query as an example:

using DataFrames, Query

df = DataFrame(sex=[:male, :female, :female, :female], age=[21., 30., 45., 34.], children=[2,2,1,2])

@from i in df begin
    @group i.age by {i.sex, i.children} into g
    @select {g.key.sex, g.key.children, age=mean(g)}
    @collect DataFrame
end

Here we have data about four individuals: their sex, their age and how many children they have. We then group this data by sex and children, and compute the average age for each group. The expression by {i.sex, i.children} creates a named tuple as the grouping key, which is handy later on, because we can now access the individual fields of the group key by their name.

Experimental standalone commands

I also added number of new standalone commands that enable a user experience that is inspired by both the method syntax of the LINQ query operators and dplyr. I am still tinkering with the precise details of this API, so things might change going forward (hence “experimental”).

Here is a code example that highlights some of the new features and how they play together with some of the other packages that I’ve created over the last year:

using Dataverse

df = load("http://www.david-anthoff.com/blog/data/exampledata.csv") |>
    @groupby(_.Children) |>
    @select({Children=_.key,Age=mean(_..Age)}) |>
    @orderby(_.Children) |>
    @tee(save("output.csv")) |>
    @where(_.Age>30) |>    
    @tee(save("output.feather")) |>
    DataFrame

First, this uses the unregistered Dataverse.jl package (I’m still looking for a better name, ideas welcome!). That package pulls together a set of data packages that play nicely together and adds a number of small experimental features on top of Query.jl.

The query starts out by loading a CSV file from a URL. The load function is from the FileIO.jl package. You can use that function to load files from web addresses or local disc, and it works for CSV, Feather, Excel, Stata, SPSS and SAS files.

The next thing to note is that this query uses the pipe operator |> to build up a data processing pipeline. Essentially you pass your data through a series of manipulation commands and chain those together via the |> operator.

The @groupby macro call is the first new standalone query command introduced in this version of Query.jl. The argument to the @groupby macro is an anonymous function that selects the key by which the source data should be grouped. The normal julia syntax for this would be @groupby(i->i.Children). In the example above I use another experimental syntax in the new Query.jl version: when you write an expression that contains the _ symbol, that expression is translated into _ -> your_original_expression. This syntax is just a short-cut to writing anonymous functions. In the context of querying table-like sources, it typically stands for the current row in query commands. The syntax _.Children here therefor extracts the value of the Children column for each row, and rows are grouped by that value.

The next command is a @select macro call. It takes an anonymous function that projects each element from the source stream. The @groupby macro creates a stream of groups, so each element that the @select macro sees will be an element of type Grouping. A Grouping element has one field key that holds the value of the group key for that group. In this example I am accessing that value via the _.key expression in the @select call. Any Grouping element is also at the same time an array of the elements that were grouped. In our example, the Grouping element is an array of the rows that make up a single group. In this example I want to compute the average age for each group. The problem here is that _ will be an array of rows, not an array of the age column for each group. But I need to pass just a vector of values to the mean function, not a vector of rows. The standard julia way to achieve this would be to use a generator expression like mean(i.Age for i in _). Because these kind of summary operations for groups are so common in data analysis, I also added another new experimental syntax to Query.jl in this version that simplifies the syntax for this common pattern. The syntax a..b is a shortcut for map(i->i.b,a) in this version, and that enables the concise syntax mean(_..Age) that is used in our example.

The next command @orderby is relatively simple: it takes an anonymous function that extracts the key by which things should be sorted from each source element. So in this example we are sorting the dataset by the Children column at this stage.

The @tee macro is another experimental feature. It mimics the tee shell command. It takes the thing that is piped into it and pipes it into both the argument of the @tee macro, and then passes the input unmodified to the next stage in the general pipeline. In the example I use this to save the intermediate dataset we have created so far into a CSV file, but then pass the data on to further data manipulation commands.

The save function is once again from the FileIO.jl package. It currently supports saving tabular data as CSV and Feather files.

The @where command filters the source dataset: only elements for which the anonymous function that is passed to the @where command returns true are passed on to the next stage. I am using the _ syntax once
again here to create the anonymous function.

The next @tee command just saves the now filtered dataset to a Feather file.

The final line of this query pipes the result into a DataFrame. Note that we never allocated or used any of the table types like DataFrame or DataTable in the query so far. We could have easily concluded the query with a save call and actually never materialized the data into a DataFrame if we just wanted to load a file, manipulate it and save it back to disc again. But if we intend to use the data for further in-memory manipulation it is of course convenient to store it in a DataFrame (or any of the other supported table types, see the list in IterableTables.jl).

You can find a bit more information about these new experimental features in the documentation here.

I am really interested in feedback on these new features! They are obviously not complete at this point, but I hope you can get a general feel for the direction, and any comments on that and anything else would be most welcome. I track both the issues over at Query.jl and the discussion on the julia forum.

One important final point: this new syntax will eventually augment the syntax we already have in Query.jl. The @from macro will not go away! In fact, one can easily combine the two styles, for example in the following way:

using Dataverse

df = load("http://www.david-anthoff.com/blog/data/exampledata.csv") |>
    @query(i, begin
        @group i by i.Children into g
        @select {Children=g.key, Age=mean(g..Age)}
    end) |>
    @where(_.Age>30) |>
    DataFrame

Various smaller changes

Query.jl now uses the iterable tables interface definition in TableTraits.jl. This change should be entirely transparent to users, it just amounts to some reorganization of packages in the background.

The release also includes a number of performance improvements and bug fixes.

Thanks

Thanks to Florian for help with the .. syntax, and to Keno and Steven for pointing me to the @tee name.

19 Likes

This looks fantastic!

Purely syntax-wise it would be nice if the syntax would be less cluttered with macro calls. I really like the DataFramesMeta’s @linq macro syntax in this regard.

As a side note, I prefer the (one) macro call even to original dplyr syntax in R because the macro call in Julia makes it clear that some unusal rules apply in this context – in R some base operators get redefined and it can introduce hard-to-track bugs elsewhere.

I actually tried hard to come up with a design that doesn’t wrap everything in one big macro :slight_smile: There are multiple reasons:

  1. I think it generally nests nicer with other code. The example I wrote down above is one that has one very large query, but I wanted these query commands to also be usable “one-off”, and if you want to only use one of them it gets kind of tedious if there is a large block around everything.
  2. It makes it easier for others to hook into this pipeline. Essentially you can use other functions in this pipeline that don’t even know about Query.jl. For example, the whole plotting story in VegaLite.jl and potentially StatPlots.jl at some point can continue these pipelines.
  3. I’m hopeful that all the magic that is happening in these macros will eventually show up in base, and at that point I could just drop the @ from these commands and they would become normal function calls. There are open issues for all the things in base that I would need for this step, and I think in general most of those things seem pretty uncontroversial. Many of those won’t happen for julia 1.0 as far as I can tell, so this is more a long-term goal. But these standalone macros in general are a lot less magic than say the @from macro.

As a side note, I prefer the (one) macro call even to original dplyr syntax in R because the macro call in Julia makes it clear that some unusal rules apply in this context – in R some base operators get redefined and it can introduce hard-to-track bugs elsewhere.

I’ve tried really hard to not change the behavior of anything inside Query.jl queries. I completely agree with you, that introduces a hard break in terms of concepts that is not good for users. I did introduce some new syntax (e.g. {}), but that is not used in base, so nothing is changed. There is one exception to that rule: the new a..b syntax changes behavior relative to what would happen outside of a query. I’m not happy with that and have an alternative plan for that scenario, but that new design will have to wait until a.b field overloading is implemented. Given how important it is to have a simple syntax for summarizing groups in a query, I felt having the a..b syntax for now was a compromise that was ok.

1 Like

To make sure I understand, it really seems that, at least in the @where case, the only magic here is the helper_replace_anon_func_syntax function that turns an expression with a _ symbol into an anonymous function.
I have two questions:

  • what is the change in base that will allow this magic to be avoided?
  • I’ll probably need something similar to this for similar reasons (to allow users to use _.a and similar stuff), but at the same time I don’t find it ideal to depend on some unexported function from Query.jl. Do you think is a good idea to create a DataMacros package that has all of these metaprogramming tools to simplify data manipulation commands? Your helper_replace_anon_func_syntax, potentially the g..b, potentially the helpers for the @df macro in StatPlots and so on…

Yes, you are spot on, that is the only magic there.

There has been a fair bit of discussion of this type of syntax in julia itself, e.g. here. I think it is very clear that nothing on that front will happen for julia 1.0, but my sense is that there is a good chance that something like that will come at some later point.

Another option would be to just copy the code? I guess we could create a shared package, but on the flipside, I hope that all of those will go away at some point, and maintaining another set of repos does add extra work, and then release cycles are linked etc.

You are right, given that it’s not supposed to be a permanent solution copying the code is probably the safest option. Also it seems like with Macrotools all of these macros take very few lines of code.

1 Like