[ANN] DataFrameIntervals.jl — joins on intervals of time

When originally announced, FlexiJoins indeed didn’t support interval overlap condition. This is one of the predicates I introduced more recently, see the update post: [ANN] FlexiJoins.jl: fresh take on joining datasets - #4 by aplavin.

As for complexity of this query, I think it should be something like M + N*sqrt(M) + K. For now, the interval overlap joins are performed this way:

  • Build a 2d KDTree for (start, end) points in one dataset.
  • Loop through the other dataset, and for each item find overlapping intervals using a 2d range query for the tree.

Clearly, this isn’t the most efficient approach, but NearestNeighbors.jl was a dependency anyway (for distance joins) – so I used it here as well.

The “overlap-join” is the only interval-related join implemented with a tree. Other conditions ( ∈,∋,⊆,⊊,⊋,⊇) use sorting and have different complexities.

How is it possible to have the complexity less than K, M, N – whichever is larger? (: