[ANN] DataFrameIntervals.jl — joins on intervals of time

Will this package support ZonedDateTime? I am currently using a combination of FlexiJoins and IntervalSets to achieve this

In the 1.5 release of DataFrames.jl we would like to add such (or similar) functionality to DataFrames.jl

That is great to hear!

It would be great to hear from you how you see it so that we can coordinate efforts and appropriately recognize and leverage your work! Thank you!

Happy to contribute! :slight_smile: In releasing this tiny package I had in mind an incremental, experimental approach to adding this feature. There are plenty of additional features for the interval_join, or for other interval-related functions one might want (I have a few in mind). Perhaps DatatFrameIntervals can be a sort of testing ground for these kind of features? I think the fully fledged version of said functions, once they’ve been a little more exercised would be awesome to add as components to DatatFrames.jl itself.

we will want only to have API that does not introduce any new significant dependencies (so probably we would handle standard objects like NamedTuples , but most likely not e.g. TimeSpans.jl).

The functionality in DatatFrameIntervals.jl is supported by the function find_intersections which I added (with support from lots of other awesome folks) to Intervals.jl in a recent PR. Under the hood I use a wrapper array called IntervalArray that creates a view of other interval-like objects and treats them as Interval objects, without copying data.

So if you want DatatFrames.jl to use the same or similar implementation that would require taking on Intervals.jl as a dependency. The advantage is that there are some very thorough unit tests handling all the various edge cases when intervals have arbitrary end points (e.g. closed/open boundary on either side). It’s not clear to me how to setup a well tested version of interval_join without having to test almost all the same edge cases, so doing so without Intervals.jl could involve a lot of duplicated effort, assuming you want to support arbitrary intervals.

2 Likes

ZonedDateTime objects should “just” work. I implement the join logic using a function I added to Intervals.jl, and that has some tests to verify that ZonedDateTime objects work as interval endpoints. I don’t have any explicit tests for them in DataFramIntervals.jl itself. Give it a whirl and let me know if you run into any issues.

As commented earlier in DataFrames.jl I think we want to have features that can be defined in terms of types/concepts defined in Base Julia, DataAPI.jl, and Tables.jl. If a functionality would require taking additional dependencies it should be a separate package.

Sounds like you would probably want a separate implementation than the one I’m using then. And it sounds like DataFrameIntervals.jl will have a use in supporting a broader range of types of intervals after DataFrames.jl 1.5. I’m happy to adjust my own API / discuss what the API should be; I’d certainly prefer if things are uniform across DataFrames.jl and DatatFrameIntervals.jl, and if they both use the same function.

One way I could imagine that working is that the interface for interval_join (or whatever it is ultimately called) and friends are implemented in DataFrames.jl, but that the core logic for matching rows to one another is a method that dispatches on the type of the columns (e.g. DataFrames.find_interval_intersections). In that way I could have DatatFrameInteravls.jll just implement that method for additional interval-like types.

You might also want to look at groupby_interval_join in DataFrameIntervals.jl which is a sort of lazy join. (I’ve been wondering if it would be worth making a more general package LazyJoins). I think this lazy approach is useful when the set of interval intersections is large and will be immediately followed by a set of groupby & combine operations.

You may also find (my) [ANN] FlexiJoins.jl: fresh take on joining datasets package useful.

  • Lots of join conditions available, including interval overlap
  • Supports a wide range of collections/table types, even DataFrames
  • Uniform interface and composability, as in by_key(:name) & by_pred(:value, ∈, :valrange)
  • Lots of other neat features – see the linked thread and the examples notebook.
  • Optimized, as in not looping over N*M pairs. Still, very specialized algorithms can be somewhat faster.
3 Likes

TSx.jl is another package where we are trying to build something similar, that is, making it easier to handle timeseries data (heterogeneous) using DataFrame at the core. The package exposes join methods based on the TimeType and Int index types. The data structure can be extended to include TimeSpan as well and in fact which is one of the goals of the package, to support time spans.

@haberdashPI See if TSx solves your purpose and if not then what can we do to make it work?

Just took a quick look. It looks cool! I’m not clear if it supports my use case. In my case the data are usually sparse, irregularly sized intervals of time. Is that something you are imagining it will be useful for (once you support TimeSpans)?

I did get a chance to see it! Last time I checked it did not support interval overlap, or I didn’t see that it was possible (though I see now that you list that as something it can do).

I hear you say it is not generally bounded by M*N but what does bound the complexity? In my case it is bound by the smaller larger of N log N, M log M and K where K is the number of rows in the output.

When originally announced, FlexiJoins indeed didn’t support interval overlap condition. This is one of the predicates I introduced more recently, see the update post: [ANN] FlexiJoins.jl: fresh take on joining datasets - #4 by aplavin.

As for complexity of this query, I think it should be something like M + N*sqrt(M) + K. For now, the interval overlap joins are performed this way:

  • Build a 2d KDTree for (start, end) points in one dataset.
  • Loop through the other dataset, and for each item find overlapping intervals using a 2d range query for the tree.

Clearly, this isn’t the most efficient approach, but NearestNeighbors.jl was a dependency anyway (for distance joins) – so I used it here as well.

The “overlap-join” is the only interval-related join implemented with a tree. Other conditions ( ∈,∋,⊆,⊊,⊋,⊇) use sorting and have different complexities.

How is it possible to have the complexity less than K, M, N – whichever is larger? (:

Oh whoops! I misspoke. I mean the larger of these three quantities!

That’s a nice general approach for all kinds of join! Awesome.

In my case, I break up each interval by their endpoints (so each is point along the real number line) and sort by these endpoints. Then I just scan the real number line and keep track of which intervals are “current” (in a Set object) to determine overlap. So that seems like it will generally be more efficient for this specific use case.

Yes, what you describe is the classical solution for this kind of problem. And this is likely the most efficient algorithm, nothing more performant is possible.

In FlexiJoins, all joins currently work in two steps: preprocess one side of the join (build KDTree in this example), then loop over the other side looking for matches for each element. This seems to better generalize to different join conditions, especially composite ones. Such an approach is also useful when performing multiple joins where one side stays the same.

In principle, looping over the second side in sorted order can help, if this is explicitly utilized in match finding. I’ve thought of that before in the context of simpler in queries, but didn’t bother enough to implement.

Is this of use?

One other thing, I noticed that you have a groupby keyword. Are join groups instantiated in a lazy fashion in this case? I find there are a lot of patterns in my work where lazy instantiation would be useful. (If it is lazy, it would be worth stating that more explicitly in your examples notebook).

@aplavin and @chiraganand: I’ve added some information to my README about each of your packages in this PR. Feel free to review the PR and provide any feedback about my description.

I’m not totally sure what you mean by “lazy” joins and join groups. Can you elaborate? I’m all for efficiency (:

Grouping was one of the first features added to FlexiJoins, was already present when I originally announced at [ANN] FlexiJoins.jl: fresh take on joining datasets. It groups either by the left or the right-hand side. For example, grouping by the left side turns the default flat list of matches [(1, 1), (1, 2), (1, 3), (3, 1)] into [(1, [1, 2, 3]), (2, []), (3, [1])].

All join results are views of the original datasets, no matter if flat/grouped. Is this what you refer to as “lazy”? However, indices of matches are always computed eagerly, don’t think there is a way around that.

For now, grouped results work with many collections and tables, except for DataFrames. They have a very different interface compared to other collections, so FlexiJoins grouping doesn’t work with them as-is. I believe the potential DataFrames support is easy to implement, but not sure what the reasonable interface should be. I don’t really encounter DataFrames myself, and don’t know what kind of return type their users would expect from a grouped join.

Ah, okay. I see, thanks for the explanation.

What I mean by a lazy join is implemented here. The actual matching of rows in dataframe X and Y to perform the join is only executed when a given group is requested. I find that a common pattern is to:

  • join,
  • group by a given subset of columns,
  • for each group: compute some set of statistics

If the data are large, and the number of rows per group large, it is much less memory intensive to do these operations in another order:

  • group by
  • for each group:
    • join
    • compute statistics over that joined group

In this order you never need to instantiate the entire joined dataframe, just the grouped data, one-by-one, and then a combined data frame that holds the statistics for each group. But it is often less ergonomic to write that code; the lazy join lets you first “do” the join and then make a call to combine that executes the operations in the more efficient order.

I think this is a generally useful tool, not just for interval_join and want to eventually put together some kind of LazyJoins package or somesuch, but haven’t gotten around to it yet.

1 Like

That’s definitely a different kind of grouping!
Currently, FlexiJoins doesn’t do anything that can be performed separately after the join. The single exception is inner/left/right join - fundamentally, they can be obtained from the outerjoin, but it’s very inconvenient to do each time.

I haven’t thought of the usecase you describe before. Not sure if I fully understand the motivation, can you describe a kinda-realistic example?

Is it grouping of one dataset before the join, while leaving the other as-is? Basically,

A |>
  group(_.field_group) |>
  map() do gr
    innerjoin((A=gr, B), by...) |>
      map(...) |>
      reduce(...)
  end

instead of

innerjoin((A, B), by...) |>
  group(_.A.field_group) |>
  map() do gr
    gr |>
      map(...) |>
      reduce(...)
  end

Or something more complex?

Also, for some common cases, joining by a composite condition is useful: like by_key(:name) & by_pred(:time, <, :time). If both datasets contain :name that should also be matched.

What I’ve implemented allows you to choose grouping variables from one or both of A and B, (because that would be possible if you did the operations in the opposite order); in your example this would translate to including a second group operation within the map before map-reducing.

Example: In my case I have ~8 hours of annotated signal data for each of ~40k recordings. There are 1000-10,000 annotations per recording. It’s much more reasonable to operate on a per-recording basis (so groupby(df, :recording)). I want to compute, e.g. the number of occurrences of some annotated event of interest in the each quarter of each recording. So I do something that looks like this, in concept:

quarters = quantile_windows(4, recordings, :quarter)
groups = groupby_interval_join(recordings, quarters, [:recording, :quarter], 
                               on=:time_span)
combine(groups, :annotation_label => (x -> sum(==(:event_of_interest), x)  => :event_count)

Doing this join over the entire 40k set of recordings is a lot more memory intensive than doing the join per-recording.

Thanks for the example!

I also encounter similar cases in my work: a large dataset that contains sets of measurements for completely separated individual objects. However, I haven’t really felt a need for such a “join-then-group” function. I tend to just group such datasets immediately after loading them, and work with a collection of groups afterwards – instead of a single flat dataset. Often, this approach isn’t even motivated by performance (for smaller datasets), but by convenience: it is conceptually simple to map a function over individual objects.

Potentially, such “join-then-group” functionality can be in-scope for FlexiJoins. But for now, I don’t have any plans in this direction.
Also, care should be taken to avoid confusion. The groupjoin term is typically used to mean grouping by one of the join sides, as FlexiJoins currently does. I think this naming originated from C# LINQ.

Further, note that the total FlexiJoins memory allocations in the Sort mode should be roughly 2 * sizeof(Int) * (length(input) + length(result)). When elements of the input dataset contain multiple fields, especially stuff like strings, the join memory usage can be less than the input size.