[ANN] FlexiJoins.jl: fresh take on joining datasets

A few months have passed from the original announcement of FlexiJoins. Lots of new features and improvements were introduced in the meantime, below is a brief overview. See also the notebook with examples and explanations.

More join conditions

Of course – this is the defining feature of FlexiJoins! (:
Specifically:

  • The a ∈ b predicate now supports collections, not only intervals.
    For example, use by_pred(:name, ∈, :names) when the names field in the right dataset contains multiple names, one of which should match the name field in the left.
  • More predicates with intervals. In addition to and , now FlexiJoins also supports:
    • Inclusion: ⊆, ⊊ and ⊋, ⊇
    • Overlap: !isdisjoint
  • The not_same() predicate, useful when joining a dataset to itself.
    It’ll return pairs (1, 2) and (2, 1), dropping (1, 1) and (2, 2), if all of them match. To keep only (1, 2), use not_same(order_matters=false).

DataFrames support

Now FlexiJoins can join and return DataFrames, further expanding the wide range of supported tables/collections.
All other collections work as-is without extra work from my side. DataFrames have a very different interface, though, so they get automatically converted to/from StructArrays. This conversion shouldn’t involve copying the full data because both table types are column-oriented.
The DataFrames support can be a bit rough, because I don’t encounter them myself and not familiar with typical expectations of their users.

Other

  • Conveniently join multiple datasets one by one
  • Assert the expected join cardinality: for example, pass cardinality=(1, 1) for a 1-to-1 matching.
  • Perform lots of similar joins with the same dataset? There’s now join_cache() to reuse preprocessing and not repeat it each time. Not documented yet, see tests or ask here.
  • Minor fixes (nothing major found) and even more extensive tests than before.

See the example notebook for how to use these and other new features.

Performance

As before, all supported join conditions in FlexiJoins use optimized algorithms: they don’t involve looping over all pairs to find matches (unless explicitly specified by mode=NestedLoop).

I’ve greatly reduced allocations and improved join performance. Now the benchmark timings are very similar to SplitApplyCombine and DataFrames. Note that only simple equijoins are compared: those two packages support nothing else.

4 Likes