Leftjoin with replacement

I thought of an option that seems a natural extension of leftjoin. Suppose that you have two dataframes, where one is a strict subset of the other.

df1 = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"], country=["US", "US", "US"])
3×3 DataFrame
 Row │ ID     Name       country 
     │ Int64  String     String  
─────┼───────────────────────────
   1 │     1  John Doe   US
   2 │     2  Jane Doe   US
   3 │     3  Joe Blogs  US

df2 = DataFrame(ID=[2], Name=["Jane Doe"], country=["CA"])
1×3 DataFrame
 Row │ ID     Name      country 
     │ Int64  String    String  
─────┼──────────────────────────
   1 │     2  Jane Doe  CA

Imagine you want to update the value of df1.country based on df2. It seems natural to have:
leftjoin!(df1, df2, on =[:ID,:Name]; replacement=true). (or some other name, like subset=true), whose result would be

 Row │ ID     Name       country 
     │ Int64  String     String  
─────┼───────────────────────────
   1 │     1  John Doe   US
   2 │     2  Jane Doe   CA
   3 │     3  Joe Blogs  US

I was thinking that this could be useful if you had df1, but after several months there are updates to information or corrections, and they are presented as in df2.

Best

See:

https://github.com/JuliaData/DataFrames.jl/issues/2243

1 Like

I hadn’t seen that post, since I wasn’t thinking of missing values in particular. But, yeah, it’s exactly the same point. Nice to know that people are already thinking about it!

use update in InMemoryDatasets.

1 Like