LightQuery 0.7.0

LighQuery 0.7 is out. It uses Base named tuples and as such is much much easier to use. I’m hoping that to have reached some level of interface stability, and that future changes will focus on performance. You should already find it extremely performant for tables with a small number of columns and no missing data.

See the docs here, which I’ve tried to make a little bit nicer. Usage and performance notes · LightQuery.jl

6 Likes

Thank you for creating this package. Do you plan to implement non-equi-joins?

I’ve thought about it. You can already sort of do it by combining Iterators.product and Iterators.filter. Is there a way to do a non-equi-join without visiting all possible pairs?

According to Join (SQL) - Inner join, SQL implementations use a hash join or sort-merge algorithm in order to avoid computing the full Cartesian product of both input tables.

Hmm, well, LightQuery essentially uses a sort-merge, but I’m not sure how to implement it for non-equi-joins (or even if it’s possible).

Consider finding all pairs in a list that sum to 10. 1 and 9 and 9 and 1 would both work, but they are on opposite sides of the list.

How about creating an index for the join-predicate (sum) that points to the corresponding rows in each table that make the predicate true? Create, insert, and delete functions would have to be aware of the join-predicate and maintain the index.

Hmm, still a bit confused but Id be very happy to take a PR

I think I misunderstood your example.

Did you mean, given a sorted list of integers from 1 to 10, find all pairs that sum to 10? In that case, for each item n in the list, you could perform a binary search on the same list for item 10 - n which would require a total of n log n visits, or complexity O(n log n).

I agree that deriving a single general algorithm for all cases would be difficult if not impossible. You’d have to use a different algorithm for different scenarios.

I meant roughly

collect(Iterators.filter(((x, y),) -> x + y == 10, Iterators.product(1:10, 1:10)))

Yes, I think that’s what I described. You’re joining together the same list of consecutive integers from 1 to 10 and selecting those pairs whose sum is 10. While you can’t avoid visiting all of the list items at least once, for each of those items, if the list is sorted, you need revisit only log n items for a total of n log n iterations which is better than the n^2 iterations that your expression performs. If the list isn’t sorted, the join couldn’t benefit from a binary search and would require n^2 iterations. If your program had to perform this join multiple times, the join algorithm would benefit from pre-sorting the list and using the sorted list instead.

So, performance of equi-joins and non-equi-joins can both benefit from sorted data. Granted, this is easier to imagine than implement.

It seems to me that the optimizations you could make would depend heavily on the join function being used?

Yes, almost certainly, so it would likely require multiple algorithms to optimise such joins. Clearly, the highest cost-benefit would come from optimising equi-joins since these are most common, but the performance of many non-equi-join conditions might also benefit from equi-join optimisations.

In DataFrames.jl issue #2340, @bkamins and @andyferris are exploring ways using SplitApplyCombine.jl to improve performance of joins between DataFrames. Though it doesn’t address non-equi-joins in particular, @bkamins commented that he does intend to address them in the future.

1 Like