[ANN] DataFrameIntervals.jl — joins on intervals of time

aplavin · July 16, 2022, 8:09pm

That’s definitely a different kind of grouping!
Currently, FlexiJoins doesn’t do anything that can be performed separately after the join. The single exception is inner/left/right join - fundamentally, they can be obtained from the outerjoin, but it’s very inconvenient to do each time.

I haven’t thought of the usecase you describe before. Not sure if I fully understand the motivation, can you describe a kinda-realistic example?

Is it grouping of one dataset before the join, while leaving the other as-is? Basically,

A |>
  group(_.field_group) |>
  map() do gr
    innerjoin((A=gr, B), by...) |>
      map(...) |>
      reduce(...)
  end

instead of

innerjoin((A, B), by...) |>
  group(_.A.field_group) |>
  map() do gr
    gr |>
      map(...) |>
      reduce(...)
  end

Or something more complex?

Also, for some common cases, joining by a composite condition is useful: like by_key(:name) & by_pred(:time, <, :time). If both datasets contain :name that should also be matched.

haberdashPI · July 21, 2022, 2:52am

What I’ve implemented allows you to choose grouping variables from one or both of A and B, (because that would be possible if you did the operations in the opposite order); in your example this would translate to including a second group operation within the map before map-reducing.

Example: In my case I have ~8 hours of annotated signal data for each of ~40k recordings. There are 1000-10,000 annotations per recording. It’s much more reasonable to operate on a per-recording basis (so groupby(df, :recording)). I want to compute, e.g. the number of occurrences of some annotated event of interest in the each quarter of each recording. So I do something that looks like this, in concept:

quarters = quantile_windows(4, recordings, :quarter)
groups = groupby_interval_join(recordings, quarters, [:recording, :quarter], 
                               on=:time_span)
combine(groups, :annotation_label => (x -> sum(==(:event_of_interest), x)  => :event_count)

Doing this join over the entire 40k set of recordings is a lot more memory intensive than doing the join per-recording.

aplavin · July 21, 2022, 10:35am

Thanks for the example!

I also encounter similar cases in my work: a large dataset that contains sets of measurements for completely separated individual objects. However, I haven’t really felt a need for such a “join-then-group” function. I tend to just group such datasets immediately after loading them, and work with a collection of groups afterwards – instead of a single flat dataset. Often, this approach isn’t even motivated by performance (for smaller datasets), but by convenience: it is conceptually simple to map a function over individual objects.

Potentially, such “join-then-group” functionality can be in-scope for FlexiJoins. But for now, I don’t have any plans in this direction.
Also, care should be taken to avoid confusion. The groupjoin term is typically used to mean grouping by one of the join sides, as FlexiJoins currently does. I think this naming originated from C# LINQ.

Further, note that the total FlexiJoins memory allocations in the Sort mode should be roughly 2 * sizeof(Int) * (length(input) + length(result)). When elements of the input dataset contain multiple fields, especially stuff like strings, the join memory usage can be less than the input size.

Juan · July 21, 2022, 3:15pm

I think
can also do it:

https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/joins/

haberdashPI · July 22, 2022, 11:23am

Sorry, I don’t follow. Do what? Lazy joins or interval joins. Can you indicate where that’s documented?

Juan · July 22, 2022, 11:27am

I mean interval joins.

Inequality-kind joins

haberdashPI · July 22, 2022, 1:50pm

Thanks @Juan, I’ve added an entry to DataFrameIntervals.jl’s README.

aplavin · July 22, 2022, 2:04pm

Judging by those docs, it only supports L.a < R.x < L.b type of joins — while interval overlap corresponds to a more complex condition.

chiraganand · July 24, 2022, 7:07pm

@haberdashPI The PR URL isn’t working?

haberdashPI · July 24, 2022, 10:03pm

Oh hey! I’m not sure why it isn’t working for you. That said, I’ve already merged, so if you have any issues with my description of TSx, please feel free to open a PR correcting any mistakes.

haberdashPI · July 25, 2022, 10:49am

I actually am not clear now, having read through the documentation for inequality-like joins, whether it does or does not support interval overlap. I find the description ambiguous. @Juan can you clarify your thinking?

Juan · July 25, 2022, 11:12am

I’m not sure either.
In fact I don’t know how are your “overlapping” intervals. Can not you get the same results with L.a < R.x < L.b conditions?

I’ve just pointed out that package because I had seen it and it’s supposed to be very fast.

The creator of InMemoryDatasets is @sl-solution

haberdashPI · July 25, 2022, 1:16pm

Ah okay, thanks for the clarification. Overlapping means that the intersection of times between the two intervals is non-empty. So L.a < R.x < R.b is just one over several possible cases where this is true.

chiraganand · July 25, 2022, 5:16pm

@haberdashPI I am unable to open the repository itself, is it private?

haberdashPI · July 26, 2022, 12:03am

Oh wow! Surprised no one else mentioned the link wasn’t working for them! I meant to make this public but it looks like something didn’t go through. Should be visible now! I would appreciate it if you could let me know if you have any further trouble accessing the repository.

chiraganand · July 26, 2022, 7:58am

Thanks @haberdashPI I can access the repository now.

Just to be clear on the plan for TSx, I would want to support any custom index for the TS structure, like BYOIT (Bring Your Own Index Type) as long as there are certain methods provided like sort, == etc. then all the other TSx methods should be available for free. Though, the actual operation like join() would be done by DataFrames.jl code so the type should be supported at that level. Also, it should support DataFrames.groupby().

In fact, TimeSpan and Interval could be good first use cases for implementing this functionality as they already exist as separate custom types.

haberdashPI · July 26, 2022, 5:04pm

@chiraganand: can you elaborate on this plan a bit? What is needed in the case of an interval is not just a sort. You need to break up the intervals into a sorted list of endpoints (left / right), and then scan for overlap: there is no == in this case: multiple intervals in dataset a can overlap with multiple intervals in dataset b (but none of them are equal); I do a join that matches all of these overlapping intervals.

sl-solution · July 27, 2022, 8:06pm

InMemoryDatasets supports the inequality joins for one column from the left data set, since, this is the case that I could develop an efficient and flexible join algorithm. Adding inequality joins for more than one column needs algorithms with O(n^2) complexity which is very expensive (even for moderate size data sets).

Nevertheless, I welcome PRs that would introduce O(n^2) joins to InMemoryDatasets. By “O(n^2) joins” I mean those kind of joins which can find matching rows by using any given function (i.e. in addition to isequal), this can be called cartesianjoin or crossjoin.

chiraganand · July 31, 2022, 11:52am

@haberdashPI Okay, now I understand. I don’t think TSx will be able to handle such custom join cases. And, because it relies on the join functionality provided by DataFrames.jl so any data type which can be handled by a DataFrame joins TSx should be able to handle.

dyeeee · October 22, 2022, 10:19am

Hi! @sl-solution I completed the development of cartesianjoin() based on IMD, which supports inequality joins with multiple columns or joins with arbitrary user-defined conditions. Besides, all parameters for IMD’s join are supported. Check the link GitHub - dyeeee/CartesianJoin.jl: A join extension package developed based on InMemoryDatasets.jl, and I provide use cases and performance benchmarks. Suggestions are welcome. Thanks.

Topic		Replies	Views
Conditional left join 2 dataframes when none of the columns are common General Usage dataframes , flexijoins	26	1990	May 26, 2022
[ANN] FlexiJoins.jl: fresh take on joining datasets Package Announcements	32	3186	April 3, 2025
Asof join support in DataFrames.jl Data dataframes	9	919	October 13, 2022
Spatial join with dataframes General Usage dataframes , geo	27	2083	September 6, 2024
Using DataFrames to split overlapping datetime intervals into non overlapping intervals General Usage question , dataframes , dataframesmeta	9	367	July 23, 2023

[ANN] DataFrameIntervals.jl — joins on intervals of time

Inequality-kind joins

Related topics