A few months have passed from the original announcement of FlexiJoins. Lots of new features and improvements were introduced in the meantime, below is a brief overview. See also the notebook with examples and explanations.
More join conditions
Of course – this is the defining feature of FlexiJoins! (:
Specifically:
- The
a ∈ bpredicate now supports collections, not only intervals.
For example, useby_pred(:name, ∈, :names)when thenamesfield in the right dataset contains multiple names, one of which should match thenamefield in the left. - More predicates with intervals. In addition to
∈and∋, nowFlexiJoinsalso supports:- Inclusion:
⊆, ⊊and⊋, ⊇ - Overlap:
!isdisjoint
- Inclusion:
- The
not_same()predicate, useful when joining a dataset to itself.
It’ll return pairs(1, 2)and(2, 1), dropping(1, 1)and(2, 2), if all of them match. To keep only(1, 2), usenot_same(order_matters=false).
DataFrames support
Now FlexiJoins can join and return DataFrames, further expanding the wide range of supported tables/collections.
All other collections work as-is without extra work from my side. DataFrames have a very different interface, though, so they get automatically converted to/from StructArrays. This conversion shouldn’t involve copying the full data because both table types are column-oriented.
The DataFrames support can be a bit rough, because I don’t encounter them myself and not familiar with typical expectations of their users.
Other
- Conveniently join multiple datasets one by one
- Assert the expected join cardinality: for example, pass
cardinality=(1, 1)for a 1-to-1 matching. - Perform lots of similar joins with the same dataset? There’s now
join_cache()to reuse preprocessing and not repeat it each time. Not documented yet, see tests or ask here. - Minor fixes (nothing major found) and even more extensive tests than before.
See the example notebook for how to use these and other new features.
Performance
As before, all supported join conditions in FlexiJoins use optimized algorithms: they don’t involve looping over all pairs to find matches (unless explicitly specified by mode=NestedLoop).
I’ve greatly reduced allocations and improved join performance. Now the benchmark timings are very similar to SplitApplyCombine and DataFrames. Note that only simple equijoins are compared: those two packages support nothing else.