I have two dataframes as shown below and need to replace one column(df.M02) with column(df2.M02) for year 2024 only. I know this can be achieved in multiple ways but I am just trying to find better way to do this.
2×13 DataFrame
Row │ name type year sub_type M01 M02 M03 M04 M05 M06 M07 M08 M09
│ String String Int64 String Float64? Float64 Float64 Float64 Float64 Float64 Float64? Float64? Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ test-24 nulltest 2024 prev_mth missing 3.4 5.6 7.0 8.0 9.0 missing missing 8.5
2 │ test-24 nulltest 2025 prev_mth missing 3.4 5.6 7.0 8.0 9.0 missing missing 8.5
2×13 DataFrame
Row │ name type year sub_type M01 M02 M03 M04 M05 M06 M07 M08 M09
│ String String Int64 String Float64? Float64 Float64 Float64 Float64 Float64 Float64? Float64? Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ test-23 nulltest 2024 prev_mth missing 3.4 5.6 7.0 8.0 9.0 missing missing 8.5
2 │ test-24 nulltest 2025 prev_mth missing 3.4 5.6 7.0 8.0 9.0 missing missing 8.5
jling
April 19, 2024, 4:03pm
2
I would say:
mask1 = df.year .== 2024
mask2 = df2.year .== 2024
@assert mask1 == mask2
@views @. df.M02[mask1] = df2.M02[mask2]
If you know that the rows are identical and in order:
df.M02 = ifelse.(df.year .== 2024, df2.M02, df.M02)
using CSV, DataFrames
df1="""
name type year sub_type M01 M02 M03 M04 M05 M06 M07 M08 M09
test-24 nulltest 2024 prev_mth missing 3.41 5.6 7.0 8.0 9.0 missing missing 8.5
test-24 nulltest 2025 prev_mth missing 3.41 5.6 7.0 8.0 9.0 missing missing 8.5
"""
df1=CSV.read(IOBuffer(df1), DataFrame, delim=' ', ignorerepeated=true)
df2="""
name type year sub_type M01 M02 M03 M04 M05 M06 M07 M08 M09
test-23 nulltest 2024 prev_mth missing 3.42 5.6 7.0 8.0 9.0 missing missing 8.5
test-24 nulltest 2025 prev_mth missing 3.42 5.6 7.0 8.0 9.0 missing missing 8.5
"""
df2=CSV.read(IOBuffer(df2), DataFrame, delim=' ', ignorerepeated=true)
grp1=groupby(df1,:year)
grp2=groupby(df2,:year)
# (df.M02) with column(df2.M02)
grp1[(2024,)].M02=grp2[(2024,)].M02
julia> df1
2×13 DataFrame
Row │ name type year sub_type M01 M02 M03 M04 M05 M06 M07 M08 M0 ⋯
│ String7 String15 Int64 String15 String7 Float64 Float64 Float64 Float64 Float64 String7 String7 Fl ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ test-24 nulltest 2024 prev_mth missing 3.42 5.6 7.0 8.0 9.0 missing missing ⋯
2 │ test-24 nulltest 2025 prev_mth missing 3.41 5.6 7.0 8.0 9.0 missing missing
1 column omitted
A leftjoin
is probably the best choice here. Unless you absolutely know that the rows in each data frame overlap, you should avoid doing stuff like df1.x = df2.x
.
Here’s a solution involving a join
and DataFramesMeta.jl
@chain df begin
leftjoin(df1, @select(df2, :id, :M02_2 = :M02), on = :id, )
@rtransform :M02 = :year == 2004 ? :M02_2 : M02
end
1 Like
despite all attempts to give an answer, the question seems to lie precisely on this aspect: the problem is not well defined.