[ANN] DataFrameIntervals.jl — joins on intervals of time

That’s definitely a different kind of grouping!
Currently, FlexiJoins doesn’t do anything that can be performed separately after the join. The single exception is inner/left/right join - fundamentally, they can be obtained from the outerjoin, but it’s very inconvenient to do each time.

I haven’t thought of the usecase you describe before. Not sure if I fully understand the motivation, can you describe a kinda-realistic example?

Is it grouping of one dataset before the join, while leaving the other as-is? Basically,

A |>
  group(_.field_group) |>
  map() do gr
    innerjoin((A=gr, B), by...) |>
      map(...) |>
      reduce(...)
  end

instead of

innerjoin((A, B), by...) |>
  group(_.A.field_group) |>
  map() do gr
    gr |>
      map(...) |>
      reduce(...)
  end

Or something more complex?

Also, for some common cases, joining by a composite condition is useful: like by_key(:name) & by_pred(:time, <, :time). If both datasets contain :name that should also be matched.

What I’ve implemented allows you to choose grouping variables from one or both of A and B, (because that would be possible if you did the operations in the opposite order); in your example this would translate to including a second group operation within the map before map-reducing.

Example: In my case I have ~8 hours of annotated signal data for each of ~40k recordings. There are 1000-10,000 annotations per recording. It’s much more reasonable to operate on a per-recording basis (so groupby(df, :recording)). I want to compute, e.g. the number of occurrences of some annotated event of interest in the each quarter of each recording. So I do something that looks like this, in concept:

quarters = quantile_windows(4, recordings, :quarter)
groups = groupby_interval_join(recordings, quarters, [:recording, :quarter], 
                               on=:time_span)
combine(groups, :annotation_label => (x -> sum(==(:event_of_interest), x)  => :event_count)

Doing this join over the entire 40k set of recordings is a lot more memory intensive than doing the join per-recording.

Thanks for the example!

I also encounter similar cases in my work: a large dataset that contains sets of measurements for completely separated individual objects. However, I haven’t really felt a need for such a “join-then-group” function. I tend to just group such datasets immediately after loading them, and work with a collection of groups afterwards – instead of a single flat dataset. Often, this approach isn’t even motivated by performance (for smaller datasets), but by convenience: it is conceptually simple to map a function over individual objects.

Potentially, such “join-then-group” functionality can be in-scope for FlexiJoins. But for now, I don’t have any plans in this direction.
Also, care should be taken to avoid confusion. The groupjoin term is typically used to mean grouping by one of the join sides, as FlexiJoins currently does. I think this naming originated from C# LINQ.

Further, note that the total FlexiJoins memory allocations in the Sort mode should be roughly 2 * sizeof(Int) * (length(input) + length(result)). When elements of the input dataset contain multiple fields, especially stuff like strings, the join memory usage can be less than the input size.

I think
can also do it:

https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/joins/

Sorry, I don’t follow. Do what? Lazy joins or interval joins. Can you indicate where that’s documented?

I mean interval joins.

Inequality-kind joins

Thanks @Juan, I’ve added an entry to DataFrameIntervals.jl’s README.

Judging by those docs, it only supports L.a < R.x < L.b type of joins — while interval overlap corresponds to a more complex condition.

@haberdashPI The PR URL isn’t working?

Oh hey! I’m not sure why it isn’t working for you. That said, I’ve already merged, so if you have any issues with my description of TSx, please feel free to open a PR correcting any mistakes.

I actually am not clear now, having read through the documentation for inequality-like joins, whether it does or does not support interval overlap. I find the description ambiguous. @Juan can you clarify your thinking?

I’m not sure either.
In fact I don’t know how are your “overlapping” intervals. Can not you get the same results with L.a < R.x < L.b conditions?

I’ve just pointed out that package because I had seen it and it’s supposed to be very fast.

The creator of InMemoryDatasets is @sl-solution

Ah okay, thanks for the clarification. Overlapping means that the intersection of times between the two intervals is non-empty. So L.a < R.x < R.b is just one over several possible cases where this is true.

@haberdashPI I am unable to open the repository itself, is it private?

1 Like

Oh wow! Surprised no one else mentioned the link wasn’t working for them! I meant to make this public but it looks like something didn’t go through. Should be visible now! I would appreciate it if you could let me know if you have any further trouble accessing the repository.

Thanks @haberdashPI I can access the repository now.

Just to be clear on the plan for TSx, I would want to support any custom index for the TS structure, like BYOIT (Bring Your Own Index Type) as long as there are certain methods provided like sort, == etc. then all the other TSx methods should be available for free. Though, the actual operation like join() would be done by DataFrames.jl code so the type should be supported at that level. Also, it should support DataFrames.groupby().

In fact, TimeSpan and Interval could be good first use cases for implementing this functionality as they already exist as separate custom types.

@chiraganand: can you elaborate on this plan a bit? What is needed in the case of an interval is not just a sort. You need to break up the intervals into a sorted list of endpoints (left / right), and then scan for overlap: there is no == in this case: multiple intervals in dataset a can overlap with multiple intervals in dataset b (but none of them are equal); I do a join that matches all of these overlapping intervals.

InMemoryDatasets supports the inequality joins for one column from the left data set, since, this is the case that I could develop an efficient and flexible join algorithm. Adding inequality joins for more than one column needs algorithms with O(n^2) complexity which is very expensive (even for moderate size data sets).

Nevertheless, I welcome PRs that would introduce O(n^2) joins to InMemoryDatasets. By “O(n^2) joins” I mean those kind of joins which can find matching rows by using any given function (i.e. in addition to isequal), this can be called cartesianjoin or crossjoin.

@haberdashPI Okay, now I understand. I don’t think TSx will be able to handle such custom join cases. And, because it relies on the join functionality provided by DataFrames.jl so any data type which can be handled by a DataFrame joins TSx should be able to handle.

Hi! @sl-solution I completed the development of cartesianjoin() based on IMD, which supports inequality joins with multiple columns or joins with arbitrary user-defined conditions. Besides, all parameters for IMD’s join are supported. Check the link GitHub - dyeeee/CartesianJoin.jl: A join extension package developed based on InMemoryDatasets.jl, and I provide use cases and performance benchmarks. Suggestions are welcome. Thanks.

1 Like