[ANN] FlexiJoins.jl: fresh take on joining datasets

Thank you, it works now :slight_smile:
Upgraded the MNWE above to a MWE.

The problem is that the values in both dataframes can be missing, not only in the second one.

By the way, is it possible to see those rows, that break the cardinality assumption, in the error message?

Yes, I see… You can use by_pred(f_L, !isdisjoint, f_R) where f_L and f_R create these 0-or-Inf intervals.
This isn’t as efficient as possible with a dedicated implementation, but should work already.

The ideal interface seems to be by_pred(:col_L, isequal_with_missing_wildcard, :col_R), probably with a nicer name (: This could directly generalize to isequal_with_wildcard(nothing) or isequal_with_wildcard(NaN)…
Btw, is a function such as isequal_with_missing_wildcard already defined somewhere?

Not that I know about. I myself defined a function

equal_missing(x,y) = any(ismissing, (x,y)) ? true : x==y

And used it inside the predicate.

Ideal would be to add some keyword that would modify the behaviour of all by_key to do this kind of check. Writing up by_pred for every variable with missing values that you want to join on is too cumbersome.

I agree, maybe by_key(f_L, f_R, isequal=isequal_with_wildcard)…

It should be, for convenience – just isn’t implemented.
What I typically do after getting a cardinality error in

innerjoin((;L, R), by_..., cardinality=(1, 1))

, is run

@p outerjoin((;L, R), by_..., groupby=:L) |> filter(length(_.R) != 1)

(+ same in the other direction, groupby=:R)
and explore the result.

Is it possible to do asof joins with maximum distance? For example, I want the most recent event that occurred at most 3 months ago. This is possible after the fact, but can it occur as part of the join?

For example, with the data:

d1 = DataFrame(event=[Date(2020), Date(2021)])
d2 = DataFrame(time=[Date(2019, 12), Date(2020, 6), Date(2019, 11)])

You can run

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time),
    multi=(identity, closest,)
)
2Γ—2 DataFrame
 Row β”‚ event       time       
     β”‚ Date        Date       
─────┼────────────────────────
   1 β”‚ 2020-01-01  2019-12-01
   2 β”‚ 2021-01-01  2020-06-01 # this is past the 3 month mark

Or you can make sure it is within the 3 month mark but then it is not for sure only 1 observation:

leftjoin(
    (d1, d2),
    by_pred(x -> x.event - Month(3) .. x.event, βˆ‹, :time),
)
3Γ—2 DataFrame
 Row β”‚ event       time       
     β”‚ Date        Date?      
─────┼────────────────────────
   1 β”‚ 2020-01-01  2019-11-01
   2 β”‚ 2020-01-01  2019-12-01
   3 β”‚ 2021-01-01  missing  

multi=closest should be supported for > < >= <= as you noticed, and for by_distance – e.g. by_distance(:event, :time, Euclidean(), <=(3)).
But I don’t think there’s anything out of the box for one-sided closest with a restriction on the max distance… It’s not more difficult, just not clear how the user-facing interface should look. For by_pred(x -> x.event - Month(3) .. x.event, βˆ‹, :time), it’s not clear what β€œclosest” would mean.

If you have suggestions for a natural interface, please share :slight_smile:
But also just filtering after join should work and be basically as efficient.

1 Like

My thought on an interface would be something like:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time),
    multi=(identity, closest(:event, :time))
)

I actually tried that when I was first looking at the package since it matches the by_key(...) interface. As an additional piece, it would be interesting if it accepted custom functions (closest(:event, fun, :time)). I think this could also allow items that are not already in the conditions to matter for picking the closest (e.g., find the closest time to an event without a restriction on how far away that time is).

An alternative would be moving the closest out of a keyword argument from the parent function and into the by_pred or by_distance:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time; closest=true)
    # or closest(by_pred(:event, >, :time))
)

To me, this clarifies what closest is being run on for the user.

A final alternative would be to combine those two:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time) & closest(:event, :time)
)

I am not sure about the feasibility of those options, but all of those to me fit in what is already done within the package.

Thanks for the suggestions! I agree that this functionality should be added as it’s clearly useful.

The main issue with putting closest into the condition itself, no matter how exactly it’s done, is that it doesn’t specify which side the β€œclosest” correspond to. Currently,

innerjoin((;A, B), ..., multi=(;B=closest))

means β€œfor each a in A, take the single closest b from B”. The A/B distinction is important as β€œfor each b in B, take the single closest a from A” would give a different result.

I guess I did not understand that from the docs.

So as a keyword argument, I think you could still do:

leftjoin(
    (; A=d1, B=d2),
    by_pred(:event, >, :time),
    multi=(; B=closest(:event, :time))
)

Or by_pred and by_distance could accept keyword arguments and do something like:

leftjoin(
    (; A=d1, B=d2),
    by_pred(:event, >, :time; mutli=(;B=closest))
)

This would be the most consistent with the current code.

Another thought would be to borrow the left/right terminology, (e.g., leftclosest and rightclosest or left_closest), so something like:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time) & leftclosest(:event, :time)
)

Regarding keyword vs positional arg – the idea is that one can use names anywhere, if desired. Both named and unnamed should be supported everywhere if datasets are passed as named args (leftjoin((; A, B), ...)):

  • by_key(:key_in_A, :key_in_B) or by_key(A=:key_in_A, B=:key_in_B)
  • multi=(identity, closest) or multi=(A=identity, B=closest) or multi=(;B=closest)

I think I like the clarity of putting multi into the condition itself, as in by_pred(..., multi=(;B=closest)) or by_key(..., multi=(;B=first)) etc. I’ll try seeing how well it fits with other pieces…
In particular, probably there should be a check that only one multi argument is provided in case of several conditions ... & ... & ....

1 Like