[ANN] FlexiJoins.jl: fresh take on joining datasets

FlexiJoins.jl is a fresh take on joining tabular or non-tabular datasets in Julia: Alexander Plavin / FlexiJoins.jl · GitLab.

From simple joins by key, to asof joins, to merging catalogs of terrestrial or celestial coordinates – FlexiJoins supports any usecase. The package is registered in General.

I’m not aware of any other similarly general implementation, neither in Julia nor in Python. At the same time, it’s only 366 lines of code!

Defining features that make the package flexible:

  • Wide range of join conditions: by key (so-called equi-join), by distance, by predicate, the closest match (asof join)
  • All kinds of joins, as in inner/left/right/outer
  • Results can either be a flat list, or grouped by the left/right side
  • Various dataset types transparently supported (not all Tables work, though)

With all these features, FlexiJoins is designed to be easy-to-use and fast:

  • Uniform interface to all functionaly
  • Performance close to other, less general, solutions: see benchmarks
  • Extensible in terms of both new join conditions and more specialized algorithms

Usage examples showcasing main features:

innerjoin((objects, measurements), by_key(:name))

leftjoin((O=objects, M=measurements), by_key(x -> x.name); groupby=:O)

innerjoin((M1=measurements, M2=measurements), by_key(:name) & by_distance(:time, Euclidean(), <=(3)))

innerjoin(
	(O=objects, M=measurements),
	by_key(:name) & by_pred(:ref_time, <, :time);
	multi=(M=closest,)
)

Documentation with explanations and more examples is available as a Pluto notebook. Docstrings also exist, but are pretty minimal for now.

I’ve been building FlexiJoins piece by piece for some time, based on what I needed. The interface and underlying implementation has proven to be flexible and extensible enough, but comments and suggestions are welcome.

33 Likes

I like that you have joins conditioned on custom predicates.

Indeed, the predicate and distance joins were the main features I missed before, and the main motivation for FlexiJoins. Also, intuitively combining join conditions together.

A few months have passed from the original announcement of FlexiJoins. Lots of new features and improvements were introduced in the meantime, below is a brief overview. See also the notebook with examples and explanations.

More join conditions

Of course – this is the defining feature of FlexiJoins! (:
Specifically:

  • The a ∈ b predicate now supports collections, not only intervals.
    For example, use by_pred(:name, ∈, :names) when the names field in the right dataset contains multiple names, one of which should match the name field in the left.
  • More predicates with intervals. In addition to and , now FlexiJoins also supports:
    • Inclusion: ⊆, ⊊ and ⊋, ⊇
    • Overlap: !isdisjoint
  • The not_same() predicate, useful when joining a dataset to itself.
    It’ll return pairs (1, 2) and (2, 1), dropping (1, 1) and (2, 2), if all of them match. To keep only (1, 2), use not_same(order_matters=false).

DataFrames support

Now FlexiJoins can join and return DataFrames, further expanding the wide range of supported tables/collections.
All other collections work as-is without extra work from my side. DataFrames have a very different interface, though, so they get automatically converted to/from StructArrays. This conversion shouldn’t involve copying the full data because both table types are column-oriented.
The DataFrames support can be a bit rough, because I don’t encounter them myself and not familiar with typical expectations of their users.

Other

  • Conveniently join multiple datasets one by one
  • Assert the expected join cardinality: for example, pass cardinality=(1, 1) for a 1-to-1 matching.
  • Perform lots of similar joins with the same dataset? There’s now join_cache() to reuse preprocessing and not repeat it each time. Not documented yet, see tests or ask here.
  • Minor fixes (nothing major found) and even more extensive tests than before.

See the example notebook for how to use these and other new features.

Performance

As before, all supported join conditions in FlexiJoins use optimized algorithms: they don’t involve looping over all pairs to find matches (unless explicitly specified by mode=NestedLoop).

I’ve greatly reduced allocations and improved join performance. Now the benchmark timings are very similar to SplitApplyCombine and DataFrames. Note that only simple equijoins are compared: those two packages support nothing else.

6 Likes

Currently, I cannot register the advertised new versions in General: Registrator throws an error. This is a recent issue, according to @aviks on Slack.
The most recent FlexiJoins is always available for installation from the repo url, and I’ll register when it becomes possible again.

Finally, the new version of FlexiJoins got into General despite a surprisingly long delay. All the features advertised above are now available in the registered version!

I’ve been using FlexiJoins quite actively myself, and consider its general interface and functionality close to complete. There are always potential performance improvements, of course.

The only new feature I sometimes wish to have, and may implement soon, is joining by array/dictionary indices instead of values. But that’s mostly it as for how my plans go.

If you encounter some useful join conditions that are missed, please let me know. Same with other features, if something holds FlexiJoins from joining all your datasets in Julia (:

6 Likes

This looks really cool! Definitely the most flexible interface for joining tables that I’ve seen.

One thing I couldn’t figure out after skimming over the docs: how do I do a simple asof join between two DataFrames on a single column? I can see the example that uses the multi argument, but it seems to require that the two tables be passed as a NamedTuple rather than a Tuple, but NamedTuples don’t see to be supported for DataFrames.

Attempts at asof join on t:

julia> t1 = DataFrame(t=[1.1, 1.3, 1.3, 1.8, 1.9], a=rand(5))
5×2 DataFrame
 Row │ t        a
     │ Float64  Float64
─────┼───────────────────
   1 │     1.1  0.346858
   2 │     1.3  0.557511
   3 │     1.3  0.964069
   4 │     1.8  0.143521
   5 │     1.9  0.170702

julia> t2 = DataFrame(t=[1.11, 1.2, 1.9, 3.4], b=rand(4))
4×2 DataFrame
 Row │ t        b
     │ Float64  Float64
─────┼────────────────────
   1 │    1.11  0.0102121
   2 │    1.2   0.299485
   3 │    1.9   0.791033
   4 │    3.4   0.468418

julia> innerjoin((t1, t2), by_pred(:t, >=, :t))
9×4 DataFrame
 Row │ t        a         t_1      b
     │ Float64  Float64   Float64  Float64
─────┼───────────────────────────────────────
   1 │     1.3  0.557511     1.11  0.0102121
   2 │     1.3  0.964069     1.11  0.0102121
   3 │     1.8  0.143521     1.11  0.0102121
   4 │     1.9  0.170702     1.11  0.0102121
   5 │     1.3  0.557511     1.2   0.299485
   6 │     1.3  0.964069     1.2   0.299485
   7 │     1.8  0.143521     1.2   0.299485
   8 │     1.9  0.170702     1.2   0.299485
   9 │     1.9  0.170702     1.9   0.791033

julia> innerjoin((t1, t2), by_pred(:t, >=, :t), multi=(closest,))
ERROR: AssertionError: length(x) == length(datas)
Stacktrace:
  [1] normalize_arg(x::Tuple{FlexiJoins.Closest}, datas::Tuple{StructVector{NamedTuple{(:t, :a), Tuple{Float64, Float64}}, NamedTuple{(:t, :a), Tuple{Vector{Float64}, Vector{Float64}}}, Int64}, StructVector{NamedTuple{(:t, :b), Tuple{Float64, Float64}}, NamedTuple{(:t, :b), Tuple{Vector{Float64}, Vector{Float64}}}, Int64}}; default::Function)
    @ FlexiJoins /me/.julia/packages/FlexiJoins/GVxX3/src/normalize_specs.jl:17
....
julia> innerjoin((T1=t1, T2=t2), by_pred(:t, >=, :t), multi=(T2=closest,))
ERROR: MethodError: no method matching keytype(::DataFrame)

Closest candidates are:
  keytype(::DataStructures.SDMExcludeLast{T}) where T<:Union{DataStructures.SortedDict, DataStructures.SortedMultiDict, DataStructures.SortedSet}
   @ DataStructures /me/.julia/packages/DataStructures/59MD0/src/container_loops.jl:28
  keytype(::DataStructures.SortedDict{K, D, Ord}) where {K, D, Ord<:Base.Order.Ordering}
   @ DataStructures /me/.julia/packages/DataStructures/59MD0/src/sorted_dict.jl:310
  keytype(::DataStructures.SDMIncludeLast{T}) where T<:Union{DataStructures.SortedDict, DataStructures.SortedMultiDict, DataStructures.SortedSet}
   @ DataStructures /me/.julia/packages/DataStructures/59MD0/src/container_loops.jl:63
  ...

Stacktrace:
  [1] (::FlexiJoins.var"#98#99")(X::DataFrame, nms::FlexiJoins.Drop)
    @ FlexiJoins /me/.julia/packages/FlexiJoins/GVxX3/src/ix_compute.jl:61
....

The short answer is that you can specify multi=(closest, identity) or multi=(identity, closest) — depending on which side asof you need.
With named tuples, it makes sense to pass multi=(R=closest,): it’s clear which side is selected. With regular tuples, multi=(closest,) is ambiguous, and the full form has to be used.

Some more context: FlexiJoins uses the same generic code for all collection types, except for dataframes that have a very different interface. So, there is a (small and I hope efficient) conversion step for them: src/FlexiJoins.jl · master · Alexander Plavin / FlexiJoins.jl · GitLab.
For now, this conversion is only performed for a tuple of dataframes: it’s basically obvious what the return value should be. With a named tuple, I can see several variants of naming the result columns: should the provided left/right names be appended to the original column names, or not?

As I don’t use DataFrames myself, it’s hard for me to tell which variant would be the best and the most expected/intuitive for dataframe users. Opinions and suggestions are welcome! I mostly join arrays, and this question doesn’t come up there.

Thanks! That makes sense and seems like a reasonable interface for DataFrames.

As a benchmark I compared FlexiJoins to kdb+'s asof join. They’ve highly optimized for sorted time series and are not nearly as flexible, so to at least see the unsorted performance fairly close is impressive.

For joining 10^4 rows to 10^7 rows, I get 2.6x slower for unsorted and 18.2x slower for sorted:

julia> n1 = 10^4; n2 = 10^7; t1 = DataFrame(t=rand(n1), a=rand(n1)); t2 = DataFrame(t=rand(n2), b=rand(n2)); t3 = sort(t2, :t);

julia> @btime leftjoin(($t1, $t2), by_pred(:t, >=, :t), multi=(identity,closest));
  2.678 s (10182 allocations: 77.90 MiB)

julia> @btime leftjoin(($t1, $t3), by_pred(:t, >=, :t), multi=(identity,closest));
  326.964 ms (10182 allocations: 77.90 MiB)

kdb+ timings are in milliseconds:

q)n1:prd 4#10; n2:prd 7#10; t1:([]t:n1?1f; a:n1?1f); t2:([]t:n2?1f; b:n2?1f); t3:`t xasc t2
q)\t aj[`t;t1;`t xasc t2]
1031
q)\t aj[`t;t1;t3]
18
1 Like

Thanks for the benckmarks!
Indeed, FlexiJoins spend the majority of time doing sorting in both of these examples. Sorting is performed even in the already sorted case, it just is much faster. Potentially, it’s possible to use something like AcceleratedArrays.jl to record the information that an array is already sorted. Not that easy to think of a reasonable interface for that, though.

If you have a single dataset that you join many others to, FlexiJoins can reuse the preparation (sorting) stage results. It’s not really documented yet, but works – and I use it in monte-carlo simulations where only one dataset changes thousands of times.

Update v0.1.28

A brief announcement of FlexiJoins updates since the last published version. The changes are relatively minor but can be important in certain cases.

  • Cleaned up dependencies, most notably removing Static.jl that caused version conflicts with other packages.
  • Added (isapprox) join predicate with atol. It’s just a more convenient alternative to interval inclusion predicates.
  • DataFrames support: now accept AbstractDataFrames as well, as suggested by @bkamins.
  • Some performance optimizations.

Also:

  • New benchmarks prompted by discussion with @bkamins (Performance vs DataFrames.jl (#3) · Issues · Alexander Plavin / FlexiJoins.jl · GitLab) indicate that FlexiJoins aren’t always as performant as DataFrames native joins. The latter handle quite a lot of “special cases” explicitly in a more efficient way. I see no fundamental issues with implementing similar optimizations within the FlexiJoins framework and interface, but these aren’t in my plans for the near future.
  • Published two new companion packages with a similar design approach: group-by for a wide range of datasets FlexiGroups, and extensions of the map function convenient for data manipulation FlexiMaps
5 Likes

When I join the datasets using DataFrames, I use the conditions matchmissing=true and validate=(true,true).
How do I reproduce this behaviour using FlexiJoins?

Do I understand correctly, that in order to join on multiple keys, I need to chain multiple by_something commands via &?

What exactly does this option do? I’m not familiar with DataFrames, and their docs don’t seem to list true as an option for matchmissing.

FlexiJoins.jl generally has the isequal(key_left, key_right) semantics for key-equality joining, although I think it’s not comprehensively tested with missings. I myself use nothing almost always instead, same as many Julia functions do.
Anyway, either missings or nothings – they should match to the same value in the other table. Please report issues if it’s not the case!

As I understand from DF.jl docs, this basically does allunique() check on the keys?

FlexiJoins.jl has the cardinality argument (see the docs), you can set it as cardinality=(1, 1) to ensure that you are getting one-to-one join. Other common value for cardinality include 0:1 for optional matches, and + for at least one match.

In the general case yes: see the “Join conditions” section in the docs.
But for multiple keys specifically, the most straightforward is to use by_key(x -> (x.key_1, x.key_2)) instead of by_key(x -> x.key_1).

1 Like

I made a mistake, I am sorry. What I really meant was the option matchmissing=:equal. What it really does is to make missing values to be considered equal to other values. I.e., if some of the data is missing but other conditions are satisfied, the two rows are considered matched.

To these regards, how does one writes a join condition, that takes the values from the two dataframes and compares them?

1 Like

Oh, interesting… I haven’t seen/noticed/needed such a mode before, and it’s not directly imlemented in FlexiJoins.jl. Potentially in scope, but I don’t have short-term plans for that unless someone makes a PR.

Meanwhile, a straightforward solution for numeric values would be to create infinite intervals for missings in this case, like

using FlexiJoins, IntervalSets

innerjoin((L, R), by_pred(:x, ∈, r->ismissing(r.x) ? -Inf..Inf : r.x±0.))

by_pred(left_func, condition, right_func), eg by_pred(l->l.x, <, r->abs(r.y)).
Efficient implementation is available for many useful conditions, but not for all binary functions of course.

If you try a function without efficient join implementation, it will error:

julia> innerjoin((1:10, 1:10), by_pred(identity,!=,identity))
ERROR: No known mode supported by by_pred(identity != identity)
Pass mode=FlexiJoins.Mode.NestedLoop() for nested loop join that always works but performs O(n*m) comparisons.

So, FlexiJoins.jl gives an option to opt into quadratic-complexity joins, but it doesn’t perform them without that opt-in.
This approach should work with arbitrary binary functions.

Any distances work, either on numbers or on vectors – and they work the same way!
You just need to specify where these vectors are in your data:

# the most common case - vector is a column, here called "xy"
by_distance(:xy, Euclidean(), <=(3))

# vector components are two columns, x and y (but consider combining them together into a single vector upstream for your general convenience :) )
by_distance(r -> SVector(r.x, r.y), Euclidean(), <=(3))

# vector length and direction are in two columns:
by_distance(r -> r.len * SVector(sincos(r.angle)...), Euclidean(), <=(3))
1 Like

The DataFrames behavior of erroring on missing key-values was proposed in Explicitly handling missingness in join columns · Issue #2499 · JuliaData/DataFrames.jl · GitHub.

Thank you.
Using the suggested anonymous function r -> [r.x, r.y], an error is produced.
Fixed MWE below with your new advice to use StaticArrays: r -> SVector(r.x, r.y). Now it works.

MWE
using FlexiJoins, Distances, Tables, StaticArrays

S = NamedTuple{(:ID1,:x,:y)}((1:4, [1.1, 2.2, 2.95, 4.3], [1.1, 2.2, 2.95, 4.3]))
R = NamedTuple{(:ID2,:x,:y)}((1000.:100:2000, 0:0.5:5, 0:0.5:5))
S = merge(S, (rho = hypot.(S.x, S.y),))
R = merge(R, (rho = hypot.(R.x, R.y),))

TS = Tables.rowtable(S)
TR = Tables.rowtable(R)

# match by distance tolerance:
r = innerjoin((O=TR, M=TS), by_distance(r -> SVector(r.x, r.y), Euclidean(), <=(0.8)); multi=(M=closest,))

Oh, sorry, didn’t test it! Looks like static vectors are required for now, updated my previous post.

For distance joins, FlexiJoins.jl uses NearestNeighbors.jl, but they only support static vectors. Otherwise, both regular vectors and arbitrary objects would’ve worked (Join by distance on strings · Issue #8 · JuliaAPlavin/FlexiJoins.jl · GitHub, 1.0 road map · Issue #180 · KristofferC/NearestNeighbors.jl · GitHub).

Note that you can just use innerjoin() instead of leftjoin() + nonmatches=drop (:

1 Like