Hi, I am trying to find an elegant solution to update a DataFrame (
B), using the non-missing elements in another DataFrame (
A) with the same columns.
For instance, with
using Dates, DataFrames;
date_1 = Date("1990-01-31", "yyyy-mm-dd");
date_2 = Date("1990-12-31", "yyyy-mm-dd");
A = DataFrame(:date=>date_2, :x=>missing, :y=>100);
B = DataFrame(:date=>collect(date_1:Dates.Month(1):date_2), :x=>rand(12), :y=>rand(12));
I would like to replace the last element in
B referring to
y with 100, while leaving
x as it is in
In my actual scenario, which might be too complicated to post here,
A can have:
- multiple rows
- fully observed
- random patterns of missing observations in
and always have dates included in
That’s what I’d like to do, but I am not sure how to handle the missings with Julia in this case. Can you please elaborate on that?
Are you coming from Stata, by chance? This is common in Stata.
I think the best option is to do
julia> leftjoin(B, A; on = :date, makeunique=true)
Row │ date x y x_1 y_1
│ Date Float64 Float64 Missing Int64?
1 │ 1990-12-31 0.697569 0.0141314 missing 100
2 │ 1990-01-31 0.890485 0.735816 missing missing
3 │ 1990-02-28 0.563947 0.26717 missing missing
4 │ 1990-03-31 0.692067 0.193784 missing missing
5 │ 1990-04-30 0.494578 0.904987 missing missing
6 │ 1990-05-31 0.709388 0.59444 missing missing
7 │ 1990-06-30 0.91898 0.26661 missing missing
8 │ 1990-07-31 0.8294 0.638376 missing missing
9 │ 1990-08-31 0.939963 0.48406 missing missing
10 │ 1990-09-30 0.116675 0.498974 missing missing
11 │ 1990-10-31 0.736912 0.5008 missing missing
12 │ 1990-11-30 0.373643 0.549012 missing missing
and then handle the updating after. via
ifelse or similar.
I brought up an
update keyword argument on github a while ago, but can’t find the post. It’s something I’d like to see eventually for sure.
No, I am not coming from Stata. I can do that, but it looks a little more computationally expensive than looping over the rows of
A and updating
B inplace - while much easier to implement. It would be very handy to have some built-in function to do that.
update keyword argument in production now?
Yeah I guess it’s worth benchmarking.
update isn’t being worked on right now. You could bring it up in an issue and see if it can be added.
in this case you can use this
other ways to lookup the desidered values:
transform(B,[:date,:y]=>ByRow((x,y)-> findfirst(==(x),A.date)===nothing ? y : A.y[findfirst(==(x),A.date)])=>:y)
transform(B,[:date,:y]=>ByRow((x,y)-> try A.y[findfirst(==(x),A.date)] catch _ y end)=>:y)
but I don’t think they are more efficient than the join function
Can you just open a new issue then please? I do issue review from time to time and deduplicate such cases, but it is better to open it. As usual - I would appreciate a precise specification of what exactly is requested as I was not sure. In particular the
findfirst approach above seems not fully correct to me as there might be duplicate entries with the same value in
A.date. (and I would not concentrate on performance now - rather the API design; later performant implementation would be proposed; as for this case join is O(n+m) and
findfirst is O(nm) so for large tables join will be much faster even now)
this might handle repeated dates occurrence, although not the performance aspects
flatten(transform(B,[:date,:y]=>ByRow((x,y)-> findall(==(x),A.date)== ? y : A.y[findall(==(x),A.date)])=>:y),:y)