ANN: LightQuery tutorial

bramtayl · January 30, 2019, 5:30am

I’ve put out a 1.0 version of LightQuery, and with it I’ve done a tutorial. This tutorial matches the one for dplyr as closely as possible. See https://bramtayl.github.io/LightQuery.jl/latest/#Tutorial-1 for the tutorial.

xiaodai · January 30, 2019, 5:41am

Suggests you rename your package to “MyPreciousQuery” given your tag line is “One query to rule them all”

xiaodai · January 30, 2019, 5:46am

What’s this package for? Looks like it’s a row-oriented querying library? Which also works on column-oriented data. But the default data is in row-oriented format.

bramtayl · January 30, 2019, 5:51am

I’d definitely consider MyPreciousQuery. Some of the functions work with iterators (e.g. rows) and some work with NamedTuples (e.g. columns). You have to explicitly move your data back and forth between row-wise and columnwise using the rows and columns functions provided, depending on what you want to do.

xiaodai · January 30, 2019, 5:58am

I thought the trend is towards columnar database because it’s generally faster due to the fact that in most cases you would want to apply functions to every element in the same column.

I was surprised that the default seems to be row-oriented? I can see how that will have bad performance for 100 million rows.

bramtayl · January 30, 2019, 6:03am

You can keep your data stored as columns if you want; I’d definitely recommend it. In fact, the flights data that I’m working with in the tutorial is stored as columns.

bramtayl · January 30, 2019, 6:04am

rows is a lazy iterator, and doesn’t affect how the underlying data is stored, if that clears things up.

bramtayl · January 30, 2019, 6:12am

A lot of the under-the-hood magic that allows column-wise collection is driven by unzip, which I’m particularly proud of.

bramtayl · January 30, 2019, 7:46pm

Genuinely curious about people’s reaction to this, esp. in comparison to other querying packages.

Yifan_Liu · January 30, 2019, 9:32pm

Is this package any different from Query.jl? Just curious.

xiaodai · January 30, 2019, 9:45pm

Oh yeah, a compare and contrast with Query would be helpful for understanding why this exists and what ppl should look for when evaluating it

bramtayl · January 31, 2019, 3:19am

Hmm, well it’s similar, but different in a couple of ways:

Efficient usage with native missing
No reliance on inference (I’m not sure if it still does, but Query did used to rely on inference)
Much simpler interface and code (two very simple macros, and only two new iterators. The rest of the iterators are all taken from Base.Iterators).
Added flexibility from relying on an explicit chaining macro
Not tested against a huge variety of data sources and sinks. I think? it should be theoretically possible to support them.
Huge performance improvements for presorted data with grouping and joining

xiaodai · January 31, 2019, 3:30am

This is interesting!

mkborregaard · January 31, 2019, 9:20am

Nice. I notice when following the tutorial that it’s a lot more verbose than dplyr. Is that by design, or do you plan to simplify the syntax?

bramtayl · January 31, 2019, 4:48pm

I have some simplifications planned for groups and joins.

bramtayl · January 31, 2019, 4:53pm

Not much I can do about the @_ _.column syntax, I’m afraid. Julia doesn’t have lazy evaluation like R does.

ValdarT · January 31, 2019, 6:08pm

I don’t want to derail the discussion but personally I’m a bit sad that the StructuredQueries project hasn’t continued. That approach makes a lot of sense because it can provide both a nice syntax as well as support for “external backends” such as databases or Spark. I think the combination of these two aspects is extremely powerful and the main reasons behind the success of dplyr and SQL.
The point here being that perhaps it’s worthwhile to focus more on the syntax and abstracting away the “backend” even at the cost of added complexity in implementation. And just in case: this is not meant as a criticism on any level, just a comment regarding

bramtayl · January 31, 2019, 6:27pm

So just to give a flavor of the kind of syntax improvements I could make, consider:

julia> dest_tailnum =
          @> flights |>
          rows |>
          order(_, select(:dest, :tailnum)) |>
          By(_, select(:dest, :tailnum)) |>
          Group |>
          over(_, @_ transform(_.first,
                    flights = length(_.second)
          )) |>
          columns(_, :dest, :tailnum, :flights)

I could provide some convenience functions, for example:

julia> dest_tailnum =
          @> flights |>
          group_by(_, :dest, :tailnum) |>
          summarize(_, flights = @_ length(_.flights)) |>
          ungroup

Which is pretty darn close to the dplyr syntax. Of course, there’s a loss in terms of flexibility and clarify (IMHO), but sounds like something people would really like?

davidanthoff · January 31, 2019, 7:30pm

Query.jls design is all setup to support that scenario, and at one point I had an example of a very simple translation to SQL. That was only a proof of concept, and would require a fair bit of work to make actually usable, but in terms of architecture everything is in place to support that kind of scenario.

affans · January 31, 2019, 7:47pm

So what does one use? query.jl or this package? Is there a way to merge them? I.e. take the best features of both ?

Topic		Replies	Views
Serious group-by performance issue with Query.jl Data	26	2337	October 13, 2019
[ANN] A new lightning fast package for data manipulation in pure Julia Package Announcements data , dataframes , inmemorydatasets	95	10615	July 4, 2022
ANN: LightQuery General Usage	2	664	January 28, 2019
Tables.jl vs TableTraits.jl (was TextParse.jl is fast again) Data	22	3118	November 22, 2018
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020

ANN: LightQuery tutorial

Related topics