[ANN] FlexiJoins.jl: fresh take on joining datasets

Thank you, it works now :slight_smile:
Upgraded the MNWE above to a MWE.

The problem is that the values in both dataframes can be missing, not only in the second one.

By the way, is it possible to see those rows, that break the cardinality assumption, in the error message?

Yes, I see… You can use by_pred(f_L, !isdisjoint, f_R) where f_L and f_R create these 0-or-Inf intervals.
This isn’t as efficient as possible with a dedicated implementation, but should work already.

The ideal interface seems to be by_pred(:col_L, isequal_with_missing_wildcard, :col_R), probably with a nicer name (: This could directly generalize to isequal_with_wildcard(nothing) or isequal_with_wildcard(NaN)
Btw, is a function such as isequal_with_missing_wildcard already defined somewhere?

Not that I know about. I myself defined a function

equal_missing(x,y) = any(ismissing, (x,y)) ? true : x==y

And used it inside the predicate.

Ideal would be to add some keyword that would modify the behaviour of all by_key to do this kind of check. Writing up by_pred for every variable with missing values that you want to join on is too cumbersome.

I agree, maybe by_key(f_L, f_R, isequal=isequal_with_wildcard)

It should be, for convenience – just isn’t implemented.
What I typically do after getting a cardinality error in

innerjoin((;L, R), by_..., cardinality=(1, 1))

, is run

@p outerjoin((;L, R), by_..., groupby=:L) |> filter(length(_.R) != 1)

(+ same in the other direction, groupby=:R)
and explore the result.