Hi, I am trying to find an elegant solution to update a DataFrame (B
), using the non-missing elements in another DataFrame (A
) with the same columns.
For instance, with
using Dates, DataFrames;
# Dates
date_1 = Date("1990-01-31", "yyyy-mm-dd");
date_2 = Date("1990-12-31", "yyyy-mm-dd");
# DataFrames
A = DataFrame(:date=>date_2, :x=>missing, :y=>100);
B = DataFrame(:date=>collect(date_1:Dates.Month(1):date_2), :x=>rand(12), :y=>rand(12));
I would like to replace the last element in B
referring to y
with 100, while leaving x
as it is in B
.
In my actual scenario, which might be too complicated to post here, A
can have:
- multiple rows
- fully observed
x
and y
- random patterns of missing observations in
x
and y
and always have dates included in B
.
Thatβs what Iβd like to do, but I am not sure how to handle the missings with Julia in this case. Can you please elaborate on that?
Are you coming from Stata, by chance? This is common in Stata.
I think the best option is to do
julia> leftjoin(B, A; on = :date, makeunique=true)
12Γ5 DataFrame
Row β date x y x_1 y_1
β Date Float64 Float64 Missing Int64?
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 1990-12-31 0.697569 0.0141314 missing 100
2 β 1990-01-31 0.890485 0.735816 missing missing
3 β 1990-02-28 0.563947 0.26717 missing missing
4 β 1990-03-31 0.692067 0.193784 missing missing
5 β 1990-04-30 0.494578 0.904987 missing missing
6 β 1990-05-31 0.709388 0.59444 missing missing
7 β 1990-06-30 0.91898 0.26661 missing missing
8 β 1990-07-31 0.8294 0.638376 missing missing
9 β 1990-08-31 0.939963 0.48406 missing missing
10 β 1990-09-30 0.116675 0.498974 missing missing
11 β 1990-10-31 0.736912 0.5008 missing missing
12 β 1990-11-30 0.373643 0.549012 missing missing
and then handle the updating after. via ifelse
or similar.
I brought up an update
keyword argument on github a while ago, but canβt find the post. Itβs something Iβd like to see eventually for sure.
1 Like
No, I am not coming from Stata. I can do that, but it looks a little more computationally expensive than looping over the rows of A
and updating B
inplace - while much easier to implement. It would be very handy to have some built-in function to do that.
Is this update
keyword argument in production now?
Yeah I guess itβs worth benchmarking.
No, update
isnβt being worked on right now. You could bring it up in an issue and see if it can be added.
in this case you can use this
joined.y=coalesce.(joined.y_1,joined.y)
other ways to lookup the desidered values:
transform(B,[:date,:y]=>ByRow((x,y)-> findfirst(==(x),A.date)===nothing ? y : A.y[findfirst(==(x),A.date)])=>:y)
transform(B,[:date,:y]=>ByRow((x,y)-> try A.y[findfirst(==(x),A.date)] catch _ y end)=>:y)
but I donβt think they are more efficient than the join function
Can you just open a new issue then please? I do issue review from time to time and deduplicate such cases, but it is better to open it. As usual - I would appreciate a precise specification of what exactly is requested as I was not sure. In particular the findfirst
approach above seems not fully correct to me as there might be duplicate entries with the same value in A.date
. (and I would not concentrate on performance now - rather the API design; later performant implementation would be proposed; as for this case join is O(n+m) and findfirst
is O(nm) so for large tables join will be much faster even now)
this might handle repeated dates occurrence, although not the performance aspects
flatten(transform(B,[:date,:y]=>ByRow((x,y)-> findall(==(x),A.date)==[] ? y : A.y[findall(==(x),A.date)])=>:y),:y)