LightQuery 0.7.0

bramtayl · August 18, 2020, 11:29pm

LighQuery 0.7 is out. It uses Base named tuples and as such is much much easier to use. I’m hoping that to have reached some level of interface stability, and that future changes will focus on performance. You should already find it extremely performant for tables with a small number of columns and no missing data.

See the docs here, which I’ve tried to make a little bit nicer. Usage and performance notes · LightQuery.jl

derekmahar · August 19, 2020, 1:25pm

Thank you for creating this package. Do you plan to implement non-equi-joins?

bramtayl · August 19, 2020, 1:28pm

I’ve thought about it. You can already sort of do it by combining Iterators.product and Iterators.filter. Is there a way to do a non-equi-join without visiting all possible pairs?

derekmahar · August 19, 2020, 1:52pm

According to Join (SQL) - Inner join, SQL implementations use a hash join or sort-merge algorithm in order to avoid computing the full Cartesian product of both input tables.

bramtayl · August 19, 2020, 2:01pm

Hmm, well, LightQuery essentially uses a sort-merge, but I’m not sure how to implement it for non-equi-joins (or even if it’s possible).

Consider finding all pairs in a list that sum to 10. 1 and 9 and 9 and 1 would both work, but they are on opposite sides of the list.

derekmahar · August 19, 2020, 2:23pm

How about creating an index for the join-predicate (sum) that points to the corresponding rows in each table that make the predicate true? Create, insert, and delete functions would have to be aware of the join-predicate and maintain the index.

bramtayl · August 19, 2020, 5:54pm

Hmm, still a bit confused but Id be very happy to take a PR

derekmahar · August 19, 2020, 8:52pm

I think I misunderstood your example.

Did you mean, given a sorted list of integers from 1 to 10, find all pairs that sum to 10? In that case, for each item n in the list, you could perform a binary search on the same list for item 10 - n which would require a total of n log n visits, or complexity O(n log n).

I agree that deriving a single general algorithm for all cases would be difficult if not impossible. You’d have to use a different algorithm for different scenarios.

bramtayl · August 19, 2020, 9:04pm

I meant roughly

collect(Iterators.filter(((x, y),) -> x + y == 10, Iterators.product(1:10, 1:10)))

derekmahar · August 19, 2020, 9:58pm

Yes, I think that’s what I described. You’re joining together the same list of consecutive integers from 1 to 10 and selecting those pairs whose sum is 10. While you can’t avoid visiting all of the list items at least once, for each of those items, if the list is sorted, you need revisit only log n items for a total of n log n iterations which is better than the n^2 iterations that your expression performs. If the list isn’t sorted, the join couldn’t benefit from a binary search and would require n^2 iterations. If your program had to perform this join multiple times, the join algorithm would benefit from pre-sorting the list and using the sorted list instead.

So, performance of equi-joins and non-equi-joins can both benefit from sorted data. Granted, this is easier to imagine than implement.

bramtayl · August 20, 2020, 12:55pm

It seems to me that the optimizations you could make would depend heavily on the join function being used?

derekmahar · August 20, 2020, 2:31pm

Yes, almost certainly, so it would likely require multiple algorithms to optimise such joins. Clearly, the highest cost-benefit would come from optimising equi-joins since these are most common, but the performance of many non-equi-join conditions might also benefit from equi-join optimisations.

In DataFrames.jl issue #2340, @bkamins and @andyferris are exploring ways using SplitApplyCombine.jl to improve performance of joins between DataFrames. Though it doesn’t address non-equi-joins in particular, @bkamins commented that he does intend to address them in the future.

Topic		Replies	Views
ANN: LightQuery tutorial Package Announcements tutorials	43	3139	July 11, 2019
Question regarding Query.jl @join implementation Data question	1	480	November 29, 2018
ANN: LightQuery General Usage	2	663	January 28, 2019
Left join algorithm for columnar tables General Usage tables , flexijoins	4	118	November 18, 2024
Serious group-by performance issue with Query.jl Data	26	2326	October 13, 2019

LightQuery 0.7.0

Related topics