Use the non-missing elements of DataFrame A to update DataFrame B

Hi, I am trying to find an elegant solution to update a DataFrame (B), using the non-missing elements in another DataFrame (A) with the same columns.

For instance, with

using Dates, DataFrames;

# Dates
date_1 = Date("1990-01-31", "yyyy-mm-dd");
date_2 = Date("1990-12-31", "yyyy-mm-dd");

# DataFrames
A = DataFrame(:date=>date_2, :x=>missing, :y=>100);
B = DataFrame(:date=>collect(date_1:Dates.Month(1):date_2), :x=>rand(12), :y=>rand(12));

I would like to replace the last element in B referring to y with 100, while leaving x as it is in B.

In my actual scenario, which might be too complicated to post here, A can have:

  1. multiple rows
  2. fully observed x and y
  3. random patterns of missing observations in x and y

and always have dates included in B.

Why not a join?

1 Like

That’s what I’d like to do, but I am not sure how to handle the missings with Julia in this case. Can you please elaborate on that?

Are you coming from Stata, by chance? This is common in Stata.

I think the best option is to do

julia> leftjoin(B, A; on = :date, makeunique=true)
12Γ—5 DataFrame
 Row β”‚ date        x         y          x_1      y_1
     β”‚ Date        Float64   Float64    Missing  Int64?
─────┼───────────────────────────────────────────────────
   1 β”‚ 1990-12-31  0.697569  0.0141314  missing      100
   2 β”‚ 1990-01-31  0.890485  0.735816   missing  missing
   3 β”‚ 1990-02-28  0.563947  0.26717    missing  missing
   4 β”‚ 1990-03-31  0.692067  0.193784   missing  missing
   5 β”‚ 1990-04-30  0.494578  0.904987   missing  missing
   6 β”‚ 1990-05-31  0.709388  0.59444    missing  missing
   7 β”‚ 1990-06-30  0.91898   0.26661    missing  missing
   8 β”‚ 1990-07-31  0.8294    0.638376   missing  missing
   9 β”‚ 1990-08-31  0.939963  0.48406    missing  missing
  10 β”‚ 1990-09-30  0.116675  0.498974   missing  missing
  11 β”‚ 1990-10-31  0.736912  0.5008     missing  missing
  12 β”‚ 1990-11-30  0.373643  0.549012   missing  missing

and then handle the updating after. via ifelse or similar.

I brought up an update keyword argument on github a while ago, but can’t find the post. It’s something I’d like to see eventually for sure.

1 Like

No, I am not coming from Stata. I can do that, but it looks a little more computationally expensive than looping over the rows of A and updating B inplace - while much easier to implement. It would be very handy to have some built-in function to do that.

Is this update keyword argument in production now?

Yeah I guess it’s worth benchmarking.

No, update isn’t being worked on right now. You could bring it up in an issue and see if it can be added.

in this case you can use this

joined.y=coalesce.(joined.y_1,joined.y)

other ways to lookup the desidered values:

transform(B,[:date,:y]=>ByRow((x,y)-> findfirst(==(x),A.date)===nothing ? y : A.y[findfirst(==(x),A.date)])=>:y)

transform(B,[:date,:y]=>ByRow((x,y)-> try A.y[findfirst(==(x),A.date)] catch _ y end)=>:y)

but I don’t think they are more efficient than the join function

Can you just open a new issue then please? I do issue review from time to time and deduplicate such cases, but it is better to open it. As usual - I would appreciate a precise specification of what exactly is requested as I was not sure. In particular the findfirst approach above seems not fully correct to me as there might be duplicate entries with the same value in A.date. (and I would not concentrate on performance now - rather the API design; later performant implementation would be proposed; as for this case join is O(n+m) and findfirst is O(nm) so for large tables join will be much faster even now)

this might handle repeated dates occurrence, although not the performance aspects

flatten(transform(B,[:date,:y]=>ByRow((x,y)-> findall(==(x),A.date)==[] ? y : A.y[findall(==(x),A.date)])=>:y),:y)