Question regarding Query.jl @join implementation

question

#1

Hi! I’ve sent this question over a DM to professor @davidanthoff, but it’s better suited here.

Regarding the Query.jl package, I’d like to know what are the implementations of @join and @groupjoin, how efficient they are, and how they work under the lazy streaming format.

I’m really curious about this since joining tables efficiently is a really old problem that has many different approaches, and the package works on so many different table types.

Thanks a lot for your time!


#2

You can see the implementation for @join here and for @groupjoin here.

Both use a hash join algorithm. In the LINQ terminology one can classify a query operator’s execution model as immediate vs deferred, and streaming vs non-streaming (here is the exact discussion). Both @join and @groupjoin have a deferred, non-streaming execution model in our implementation right now. What that means is that the first call to iterate will trigger full iteration of both sources for the join, and a complete materialization of the output of the join.

That could actually be improved, though: those implementations could be changed so that at least one of the join sources is not fully iterated in the first call to iterate, and instead properly streamed.

Our medium term project (that we are kind of starting now) is to create another backend for Query.jl that is specific to tabular sources, and then provide a full query optimizer that can pick more efficient algorithms, depending on the query one is trying to execute. Don’t expect anything in the next couple of months, but we are trying out different designs etc. right now.