Hi! I’ve sent this question over a DM to professor @davidanthoff, but it’s better suited here.
Regarding the Query.jl package, I’d like to know what are the implementations of @join
and @groupjoin
, how efficient they are, and how they work under the lazy streaming format.
I’m really curious about this since joining tables efficiently is a really old problem that has many different approaches, and the package works on so many different table types.
Thanks a lot for your time!
1 Like
You can see the implementation for @join
here and for @groupjoin
here.
Both use a hash join algorithm. In the LINQ terminology one can classify a query operator’s execution model as immediate vs deferred, and streaming vs non-streaming (here is the exact discussion). Both @join
and @groupjoin
have a deferred, non-streaming execution model in our implementation right now. What that means is that the first call to iterate
will trigger full iteration of both sources for the join, and a complete materialization of the output of the join.
That could actually be improved, though: those implementations could be changed so that at least one of the join sources is not fully iterated in the first call to iterate
, and instead properly streamed.
Our medium term project (that we are kind of starting now) is to create another backend for Query.jl that is specific to tabular sources, and then provide a full query optimizer that can pick more efficient algorithms, depending on the query one is trying to execute. Don’t expect anything in the next couple of months, but we are trying out different designs etc. right now.
3 Likes