[ANN] FlexiJoins.jl: fresh take on joining datasets

aplavin · July 14, 2022, 4:49pm

A few months have passed from the original announcement of FlexiJoins. Lots of new features and improvements were introduced in the meantime, below is a brief overview. See also the notebook with examples and explanations.

More join conditions

Of course – this is the defining feature of FlexiJoins! (:
Specifically:

The a ∈ b predicate now supports collections, not only intervals.
For example, use by_pred(:name, ∈, :names) when the names field in the right dataset contains multiple names, one of which should match the name field in the left.
More predicates with intervals. In addition to ∈ and ∋, now FlexiJoins also supports:
- Inclusion: ⊆, ⊊ and ⊋, ⊇
- Overlap: !isdisjoint
The not_same() predicate, useful when joining a dataset to itself.
It’ll return pairs (1, 2) and (2, 1), dropping (1, 1) and (2, 2), if all of them match. To keep only (1, 2), use not_same(order_matters=false).

`DataFrames` support

Now FlexiJoins can join and return DataFrames, further expanding the wide range of supported tables/collections.
All other collections work as-is without extra work from my side. DataFrames have a very different interface, though, so they get automatically converted to/from StructArrays. This conversion shouldn’t involve copying the full data because both table types are column-oriented.
The DataFrames support can be a bit rough, because I don’t encounter them myself and not familiar with typical expectations of their users.

Other

Conveniently join multiple datasets one by one
Assert the expected join cardinality: for example, pass cardinality=(1, 1) for a 1-to-1 matching.
Perform lots of similar joins with the same dataset? There’s now join_cache() to reuse preprocessing and not repeat it each time. Not documented yet, see tests or ask here.
Minor fixes (nothing major found) and even more extensive tests than before.

See the example notebook for how to use these and other new features.

Performance

As before, all supported join conditions in FlexiJoins use optimized algorithms: they don’t involve looping over all pairs to find matches (unless explicitly specified by mode=NestedLoop).

I’ve greatly reduced allocations and improved join performance. Now the benchmark timings are very similar to SplitApplyCombine and DataFrames. Note that only simple equijoins are compared: those two packages support nothing else.

Topic		Replies	Views
Conditional left join 2 dataframes when none of the columns are common General Usage dataframes , flexijoins	26	1978	May 26, 2022
Spatial join with dataframes General Usage dataframes , geo	27	2071	September 6, 2024
[ANN] DataFrameIntervals.jl — joins on intervals of time Package Announcements dataframes	40	2283	October 31, 2022
FlexiJoins vs SortMerge (particularly in astronomy workflows) Astro/Space	7	110	May 5, 2025
Arbitrary table join conditions Data package , data , dataframes , splitapplycombine	9	1771	August 16, 2020

[ANN] FlexiJoins.jl: fresh take on joining datasets

More join conditions

DataFrames support

Other

Performance

Related topics

`DataFrames` support