A few months have passed from the original announcement of FlexiJoins
. Lots of new features and improvements were introduced in the meantime, below is a brief overview. See also the notebook with examples and explanations.
More join conditions
Of course – this is the defining feature of FlexiJoins
! (:
Specifically:
- The
a ∈ b
predicate now supports collections, not only intervals.
For example, useby_pred(:name, ∈, :names)
when thenames
field in the right dataset contains multiple names, one of which should match thename
field in the left. - More predicates with intervals. In addition to
∈
and∋
, nowFlexiJoins
also supports:- Inclusion:
⊆, ⊊
and⊋, ⊇
- Overlap:
!isdisjoint
- Inclusion:
- The
not_same()
predicate, useful when joining a dataset to itself.
It’ll return pairs(1, 2)
and(2, 1)
, dropping(1, 1)
and(2, 2)
, if all of them match. To keep only(1, 2)
, usenot_same(order_matters=false)
.
DataFrames
support
Now FlexiJoins
can join and return DataFrames
, further expanding the wide range of supported tables/collections.
All other collections work as-is without extra work from my side. DataFrames
have a very different interface, though, so they get automatically converted to/from StructArray
s. This conversion shouldn’t involve copying the full data because both table types are column-oriented.
The DataFrames support can be a bit rough, because I don’t encounter them myself and not familiar with typical expectations of their users.
Other
- Conveniently join multiple datasets one by one
- Assert the expected join cardinality: for example, pass
cardinality=(1, 1)
for a 1-to-1 matching. - Perform lots of similar joins with the same dataset? There’s now
join_cache()
to reuse preprocessing and not repeat it each time. Not documented yet, see tests or ask here. - Minor fixes (nothing major found) and even more extensive tests than before.
See the example notebook for how to use these and other new features.
Performance
As before, all supported join conditions in FlexiJoins
use optimized algorithms: they don’t involve looping over all pairs to find matches (unless explicitly specified by mode=NestedLoop
).
I’ve greatly reduced allocations and improved join performance. Now the benchmark timings are very similar to SplitApplyCombine
and DataFrames
. Note that only simple equijoins are compared: those two packages support nothing else.