ANN: LightQuery tutorial

I’ve put out a 1.0 version of LightQuery, and with it I’ve done a tutorial. This tutorial matches the one for dplyr as closely as possible. See https://bramtayl.github.io/LightQuery.jl/latest/#Tutorial-1 for the tutorial.

4 Likes

Suggests you rename your package to “MyPreciousQuery” given your tag line is “One query to rule them all” :grinning:

1 Like

What’s this package for? Looks like it’s a row-oriented querying library? Which also works on column-oriented data. But the default data is in row-oriented format.

I’d definitely consider MyPreciousQuery. Some of the functions work with iterators (e.g. rows) and some work with NamedTuples (e.g. columns). You have to explicitly move your data back and forth between row-wise and columnwise using the rows and columns functions provided, depending on what you want to do.

I thought the trend is towards columnar database because it’s generally faster due to the fact that in most cases you would want to apply functions to every element in the same column.

I was surprised that the default seems to be row-oriented? I can see how that will have bad performance for 100 million rows.

You can keep your data stored as columns if you want; I’d definitely recommend it. In fact, the flights data that I’m working with in the tutorial is stored as columns.

rows is a lazy iterator, and doesn’t affect how the underlying data is stored, if that clears things up.

1 Like

A lot of the under-the-hood magic that allows column-wise collection is driven by unzip, which I’m particularly proud of.

3 Likes

Genuinely curious about people’s reaction to this, esp. in comparison to other querying packages.

Is this package any different from Query.jl? Just curious.

Oh yeah, a compare and contrast with Query would be helpful for understanding why this exists and what ppl should look for when evaluating it

3 Likes

Hmm, well it’s similar, but different in a couple of ways:

  1. Efficient usage with native missing
  2. No reliance on inference (I’m not sure if it still does, but Query did used to rely on inference)
  3. Much simpler interface and code (two very simple macros, and only two new iterators. The rest of the iterators are all taken from Base.Iterators).
  4. Added flexibility from relying on an explicit chaining macro
  5. Not tested against a huge variety of data sources and sinks. I think? it should be theoretically possible to support them.
  6. Huge performance improvements for presorted data with grouping and joining
2 Likes

This is interesting!

Nice. I notice when following the tutorial that it’s a lot more verbose than dplyr. Is that by design, or do you plan to simplify the syntax?

I have some simplifications planned for groups and joins.

1 Like

Not much I can do about the @_ _.column syntax, I’m afraid. Julia doesn’t have lazy evaluation like R does.

I don’t want to derail the discussion but personally I’m a bit sad that the StructuredQueries project hasn’t continued. That approach makes a lot of sense because it can provide both a nice syntax as well as support for “external backends” such as databases or Spark. I think the combination of these two aspects is extremely powerful and the main reasons behind the success of dplyr and SQL.
The point here being that perhaps it’s worthwhile to focus more on the syntax and abstracting away the “backend” even at the cost of added complexity in implementation. And just in case: this is not meant as a criticism on any level, just a comment regarding

So just to give a flavor of the kind of syntax improvements I could make, consider:

julia> dest_tailnum =
          @> flights |>
          rows |>
          order(_, select(:dest, :tailnum)) |>
          By(_, select(:dest, :tailnum)) |>
          Group |>
          over(_, @_ transform(_.first,
                    flights = length(_.second)
          )) |>
          columns(_, :dest, :tailnum, :flights)

I could provide some convenience functions, for example:

julia> dest_tailnum =
          @> flights |>
          group_by(_, :dest, :tailnum) |>
          summarize(_, flights = @_ length(_.flights)) |>
          ungroup

Which is pretty darn close to the dplyr syntax. Of course, there’s a loss in terms of flexibility and clarify (IMHO), but sounds like something people would really like?

1 Like

Query.jls design is all setup to support that scenario, and at one point I had an example of a very simple translation to SQL. That was only a proof of concept, and would require a fair bit of work to make actually usable, but in terms of architecture everything is in place to support that kind of scenario.

So what does one use? query.jl or this package? Is there a way to merge them? I.e. take the best features of both ?:smiley:

1 Like