Getting the min value of two dataframes with identical cols

I have two almost identical df

julia> df1 = DataFrame(ID = [1, 2, 3, 4,6],
                       Value = [10, 20, 30, 40,10])
5×2 DataFrame
 Row │ ID     Value 
     │ Int64  Int64 
─────┼──────────────
   1 │     1     10
   2 │     2     20
   3 │     3     30
   4 │     4     40
   5 │     6     10

julia> df2 = DataFrame(ID = [1, 2, 3, 4,5],
                       Value = [15, 25, 5, 45,30])
5×2 DataFrame
 Row │ ID     Value 
     │ Int64  Int64 
─────┼──────────────
   1 │     1     15
   2 │     2     25
   3 │     3      5
   4 │     4     45
   5 │     5     30

I want to obtain a new dataframe with the min value for all existing rows from column “value”.

julia> df 
6×2 DataFrame
 Row │ ID     Value 
     │ Int64  Int64 
─────┼──────────────
   1 │     1     10
   2 │     2     20
   3 │     3      5
   4 │     4     40
   5 │     5     30
   6 │     5     10

Any easy way to do this?
Thanks.

Here’s one way. First we join both data frames using an outer join (this keeps rows where an ID is missing from one of them):

julia> df = outerjoin(df1, df2, on=:ID, makeunique=true)
6×3 DataFrame
 Row │ ID     Value    Value_1 
     │ Int64  Int64?   Int64?  
─────┼─────────────────────────
   1 │     1       10       15
   2 │     2       20       25
   3 │     3       30        5
   4 │     4       40       45
   5 │     6       10  missing 
   6 │     5  missing       30

Then we transform the data frame by passing each row to a function that ignores missing values and keeps the minimum:

julia> transform(df, AsTable([:Value, :Value_1]) =>
                     ByRow(minimum∘skipmissing))
6×4 DataFrame
 Row │ ID     Value    Value_1  Value_Value_1_minimum_skipmissing 
     │ Int64  Int64?   Int64?   Int64                             
─────┼────────────────────────────────────────────────────────────
   1 │     1       10       15                                 10
   2 │     2       20       25                                 20
   3 │     3       30        5                                  5
   4 │     4       40       45                                 40
   5 │     6       10  missing                                 10
   6 │     5  missing       30                                 30

This uses AsTable so that the two values are passed to the processing function as a named tuple rather than two separate arguments.

You could also do the second step without using the DataFrame minilanguage:

df.min = minimum.(skipmissing.(eachrow(df[:, r"Value"])))

# Result:
6×4 DataFrame
 Row │ ID     Value    Value_1  min   
     │ Int64  Int64?   Int64?   Int64 
─────┼────────────────────────────────
   1 │     1       10       15     10
   2 │     2       20       25     20
   3 │     3       30        5      5
   4 │     4       40       45     40
   5 │     6       10  missing     10
   6 │     5  missing       30     30
1 Like

I don’t know if it’s more efficient, but it’s another way of achieving the same result

df=vcat(df1,df2)
gdf=groupby(df,:ID)

combine(gdf,:Value=>x->min(x...))

# or

combine(gdf,:Value=>Base.splat(min))

2 Likes

Here is a version with DataFramesMeta.jl

julia> df = @chain df1 begin
           leftjoin(df2; on = :ID, renamecols = "_1" => "_2")
           @rtransform :min_Value = min(:Value_1, :Value_2)
       end
5×4 DataFrame
 Row │ ID     Value_1  Value_2  min_Value 
     │ Int64  Int64    Int64?   Int64?    
─────┼────────────────────────────────────
   1 │     1       10       15         10
   2 │     2       20       25         20
   3 │     3       30        5          5
   4 │     4       40       45         40
   5 │     6       10  missing    missing 

julia> # To handle missings
       df = @chain df1 begin
           leftjoin(df2; on = :ID, renamecols = "_1" => "_2")
           @rtransform :min_Value = minimum(skipmissing(AsTable([:Value_1, :Value_2])))
       end
5×4 DataFrame
 Row │ ID     Value_1  Value_2  min_Value 
     │ Int64  Int64    Int64?   Int64     
─────┼────────────────────────────────────
   1 │     1       10       15         10
   2 │     2       20       25         20
   3 │     3       30        5          5
   4 │     4       40       45         40
   5 │     6       10  missing         10

Are you coming from Stata, by chance?

I don’t know why, this form, which was the first one I tried, didn’t work for me.
I tried it again now with the intention of asking on the forum or opening an issue, if necessary.

In fact, as well as being of more general validity (also valid for many “almost equal” DFs), it also seems more Julianic to me

combine(gdf,:Value=>minimum)

Another method, bypassing some of the standard DataFrame wrangling machinary:

let
    dd = mergewith(min,(Dict(d.ID .=> d.Value) for d in (df1,df2))...)
    sort!(DataFrame(ID=collect(keys(dd)), Value=collect(values(dd))), :ID)
end

Giving again:

6×2 DataFrame
 Row │ ID     Value 
     │ Int64  Int64 
─────┼──────────────
   1 │     1     10
   2 │     2     20
   3 │     3      5
   4 │     4     40
   5 │     5     30
   6 │     6     10