[ANN] FlexiJoins.jl: fresh take on joining datasets

rafael.guerra · July 8, 2024, 11:04am

Thank you, it works now
Upgraded the MNWE above to a MWE.

Gregstrq · July 8, 2024, 1:07pm

The problem is that the values in both dataframes can be missing, not only in the second one.

Gregstrq · July 8, 2024, 3:34pm

By the way, is it possible to see those rows, that break the cardinality assumption, in the error message?

aplavin · July 8, 2024, 3:48pm

Yes, I see… You can use by_pred(f_L, !isdisjoint, f_R) where f_L and f_R create these 0-or-Inf intervals.
This isn’t as efficient as possible with a dedicated implementation, but should work already.

The ideal interface seems to be by_pred(:col_L, isequal_with_missing_wildcard, :col_R), probably with a nicer name (: This could directly generalize to isequal_with_wildcard(nothing) or isequal_with_wildcard(NaN)…
Btw, is a function such as isequal_with_missing_wildcard already defined somewhere?

Gregstrq · July 8, 2024, 3:57pm

Not that I know about. I myself defined a function

equal_missing(x,y) = any(ismissing, (x,y)) ? true : x==y

And used it inside the predicate.

Ideal would be to add some keyword that would modify the behaviour of all by_key to do this kind of check. Writing up by_pred for every variable with missing values that you want to join on is too cumbersome.

aplavin · July 8, 2024, 9:27pm

I agree, maybe by_key(f_L, f_R, isequal=isequal_with_wildcard)…

aplavin · July 8, 2024, 9:27pm

It should be, for convenience – just isn’t implemented.
What I typically do after getting a cardinality error in

innerjoin((;L, R), by_..., cardinality=(1, 1))

, is run

@p outerjoin((;L, R), by_..., groupby=:L) |> filter(length(_.R) != 1)

(+ same in the other direction, groupby=:R)
and explore the result.

junder873 · April 1, 2025, 12:20am

Is it possible to do asof joins with maximum distance? For example, I want the most recent event that occurred at most 3 months ago. This is possible after the fact, but can it occur as part of the join?

For example, with the data:

d1 = DataFrame(event=[Date(2020), Date(2021)])
d2 = DataFrame(time=[Date(2019, 12), Date(2020, 6), Date(2019, 11)])

You can run

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time),
    multi=(identity, closest,)
)
2×2 DataFrame
 Row │ event       time       
     │ Date        Date       
─────┼────────────────────────
   1 │ 2020-01-01  2019-12-01
   2 │ 2021-01-01  2020-06-01 # this is past the 3 month mark

Or you can make sure it is within the 3 month mark but then it is not for sure only 1 observation:

leftjoin(
    (d1, d2),
    by_pred(x -> x.event - Month(3) .. x.event, ∋, :time),
)
3×2 DataFrame
 Row │ event       time       
     │ Date        Date?      
─────┼────────────────────────
   1 │ 2020-01-01  2019-11-01
   2 │ 2020-01-01  2019-12-01
   3 │ 2021-01-01  missing

aplavin · April 1, 2025, 8:28pm

multi=closest should be supported for > < >= <= as you noticed, and for by_distance – e.g. by_distance(:event, :time, Euclidean(), <=(3)).
But I don’t think there’s anything out of the box for one-sided closest with a restriction on the max distance… It’s not more difficult, just not clear how the user-facing interface should look. For by_pred(x -> x.event - Month(3) .. x.event, ∋, :time), it’s not clear what “closest” would mean.

If you have suggestions for a natural interface, please share
But also just filtering after join should work and be basically as efficient.

junder873 · April 1, 2025, 9:17pm

My thought on an interface would be something like:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time),
    multi=(identity, closest(:event, :time))
)

I actually tried that when I was first looking at the package since it matches the by_key(...) interface. As an additional piece, it would be interesting if it accepted custom functions (closest(:event, fun, :time)). I think this could also allow items that are not already in the conditions to matter for picking the closest (e.g., find the closest time to an event without a restriction on how far away that time is).

An alternative would be moving the closest out of a keyword argument from the parent function and into the by_pred or by_distance:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time; closest=true)
    # or closest(by_pred(:event, >, :time))
)

To me, this clarifies what closest is being run on for the user.

A final alternative would be to combine those two:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time) & closest(:event, :time)
)

I am not sure about the feasibility of those options, but all of those to me fit in what is already done within the package.

aplavin · April 2, 2025, 2:54am

Thanks for the suggestions! I agree that this functionality should be added as it’s clearly useful.

The main issue with putting closest into the condition itself, no matter how exactly it’s done, is that it doesn’t specify which side the “closest” correspond to. Currently,

innerjoin((;A, B), ..., multi=(;B=closest))

means “for each a in A, take the single closest b from B”. The A/B distinction is important as “for each b in B, take the single closest a from A” would give a different result.

junder873 · April 2, 2025, 7:44pm

I guess I did not understand that from the docs.

So as a keyword argument, I think you could still do:

leftjoin(
    (; A=d1, B=d2),
    by_pred(:event, >, :time),
    multi=(; B=closest(:event, :time))
)

Or by_pred and by_distance could accept keyword arguments and do something like:

leftjoin(
    (; A=d1, B=d2),
    by_pred(:event, >, :time; mutli=(;B=closest))
)

This would be the most consistent with the current code.

Another thought would be to borrow the left/right terminology, (e.g., leftclosest and rightclosest or left_closest), so something like:

leftjoin(
    (d1, d2),
    by_pred(:event, >, :time) & leftclosest(:event, :time)
)

aplavin · April 3, 2025, 1:31am

Regarding keyword vs positional arg – the idea is that one can use names anywhere, if desired. Both named and unnamed should be supported everywhere if datasets are passed as named args (leftjoin((; A, B), ...)):

by_key(:key_in_A, :key_in_B) or by_key(A=:key_in_A, B=:key_in_B)
multi=(identity, closest) or multi=(A=identity, B=closest) or multi=(;B=closest)

I think I like the clarity of putting multi into the condition itself, as in by_pred(..., multi=(;B=closest)) or by_key(..., multi=(;B=first)) etc. I’ll try seeing how well it fits with other pieces…
In particular, probably there should be a check that only one multi argument is provided in case of several conditions ... & ... & ....

Topic		Replies	Views
FlexiJoins vs SortMerge (particularly in astronomy workflows) Astro/Space	7	109	May 5, 2025
Left join algorithm for columnar tables General Usage tables , flexijoins	4	118	November 18, 2024
Asof join support in DataFrames.jl Data dataframes	9	918	October 13, 2022
[ANN] FlexiGroups.jl -- composable and general dataset group-bys Package Announcements	1	552	November 2, 2022
Perform spatial join with GeoDataFrames New to Julia dataframes , geo	3	877	January 20, 2023

[ANN] FlexiJoins.jl: fresh take on joining datasets

Related topics