Use the non-missing elements of DataFrame A to update DataFrame B

fipelle · July 14, 2021, 8:19pm

Hi, I am trying to find an elegant solution to update a DataFrame (B), using the non-missing elements in another DataFrame (A) with the same columns.

For instance, with

using Dates, DataFrames;

# Dates
date_1 = Date("1990-01-31", "yyyy-mm-dd");
date_2 = Date("1990-12-31", "yyyy-mm-dd");

# DataFrames
A = DataFrame(:date=>date_2, :x=>missing, :y=>100);
B = DataFrame(:date=>collect(date_1:Dates.Month(1):date_2), :x=>rand(12), :y=>rand(12));

I would like to replace the last element in B referring to y with 100, while leaving x as it is in B.

In my actual scenario, which might be too complicated to post here, A can have:

multiple rows
fully observed x and y
random patterns of missing observations in x and y

and always have dates included in B.

pdeffebach · July 14, 2021, 8:45pm

Why not a join?

fipelle · July 14, 2021, 10:49pm

That’s what I’d like to do, but I am not sure how to handle the missings with Julia in this case. Can you please elaborate on that?

pdeffebach · July 14, 2021, 10:55pm

Are you coming from Stata, by chance? This is common in Stata.

I think the best option is to do

julia> leftjoin(B, A; on = :date, makeunique=true)
12×5 DataFrame
 Row │ date        x         y          x_1      y_1
     │ Date        Float64   Float64    Missing  Int64?
─────┼───────────────────────────────────────────────────
   1 │ 1990-12-31  0.697569  0.0141314  missing      100
   2 │ 1990-01-31  0.890485  0.735816   missing  missing
   3 │ 1990-02-28  0.563947  0.26717    missing  missing
   4 │ 1990-03-31  0.692067  0.193784   missing  missing
   5 │ 1990-04-30  0.494578  0.904987   missing  missing
   6 │ 1990-05-31  0.709388  0.59444    missing  missing
   7 │ 1990-06-30  0.91898   0.26661    missing  missing
   8 │ 1990-07-31  0.8294    0.638376   missing  missing
   9 │ 1990-08-31  0.939963  0.48406    missing  missing
  10 │ 1990-09-30  0.116675  0.498974   missing  missing
  11 │ 1990-10-31  0.736912  0.5008     missing  missing
  12 │ 1990-11-30  0.373643  0.549012   missing  missing

and then handle the updating after. via ifelse or similar.

I brought up an update keyword argument on github a while ago, but can’t find the post. It’s something I’d like to see eventually for sure.

fipelle · July 14, 2021, 11:26pm

No, I am not coming from Stata. I can do that, but it looks a little more computationally expensive than looping over the rows of A and updating B inplace - while much easier to implement. It would be very handy to have some built-in function to do that.

Is this update keyword argument in production now?

pdeffebach · July 15, 2021, 3:25am

Yeah I guess it’s worth benchmarking.

No, update isn’t being worked on right now. You could bring it up in an issue and see if it can be added.

rocco_sprmnt21 · July 18, 2021, 8:04am

in this case you can use this

joined.y=coalesce.(joined.y_1,joined.y)

rocco_sprmnt21 · July 18, 2021, 8:07am

other ways to lookup the desidered values:

transform(B,[:date,:y]=>ByRow((x,y)-> findfirst(==(x),A.date)===nothing ? y : A.y[findfirst(==(x),A.date)])=>:y)

transform(B,[:date,:y]=>ByRow((x,y)-> try A.y[findfirst(==(x),A.date)] catch _ y end)=>:y)

but I don’t think they are more efficient than the join function

bkamins · July 18, 2021, 8:20am

Can you just open a new issue then please? I do issue review from time to time and deduplicate such cases, but it is better to open it. As usual - I would appreciate a precise specification of what exactly is requested as I was not sure. In particular the findfirst approach above seems not fully correct to me as there might be duplicate entries with the same value in A.date. (and I would not concentrate on performance now - rather the API design; later performant implementation would be proposed; as for this case join is O(n+m) and findfirst is O(nm) so for large tables join will be much faster even now)

rocco_sprmnt21 · July 18, 2021, 8:53am

this might handle repeated dates occurrence, although not the performance aspects

flatten(transform(B,[:date,:y]=>ByRow((x,y)-> findall(==(x),A.date)==[] ? y : A.y[findall(==(x),A.date)])=>:y),:y)

Topic		Replies	Views
Fill gaps in DataFrame Data dataframes	10	1254	May 13, 2021
Assembling data using DataFrame such that rows where a particular array has empty elements are discounted New to Julia	5	329	July 12, 2020
Adding information from one DataFrame to another New to Julia	11	491	July 29, 2021
How to compare non-missing elements of two DataFrames New to Julia	3	557	July 1, 2020
Identify and add missing rows dynamically in dataframe General Usage dataframes , transformers	6	833	April 9, 2024

Use the non-missing elements of DataFrame A to update DataFrame B

Related topics