I have two almost identical df
julia> df1 = DataFrame(ID = [1, 2, 3, 4,6],
Value = [10, 20, 30, 40,10])
5×2 DataFrame
Row │ ID Value
│ Int64 Int64
─────┼──────────────
1 │ 1 10
2 │ 2 20
3 │ 3 30
4 │ 4 40
5 │ 6 10
julia> df2 = DataFrame(ID = [1, 2, 3, 4,5],
Value = [15, 25, 5, 45,30])
5×2 DataFrame
Row │ ID Value
│ Int64 Int64
─────┼──────────────
1 │ 1 15
2 │ 2 25
3 │ 3 5
4 │ 4 45
5 │ 5 30
I want to obtain a new dataframe with the min value for all existing rows from column “value”.
julia> df
6×2 DataFrame
Row │ ID Value
│ Int64 Int64
─────┼──────────────
1 │ 1 10
2 │ 2 20
3 │ 3 5
4 │ 4 40
5 │ 5 30
6 │ 5 10
Any easy way to do this?
Thanks.
sijo
February 26, 2024, 2:21pm
2
Here’s one way. First we join both data frames using an outer join (this keeps rows where an ID is missing from one of them):
julia> df = outerjoin(df1, df2, on=:ID, makeunique=true)
6×3 DataFrame
Row │ ID Value Value_1
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 10 15
2 │ 2 20 25
3 │ 3 30 5
4 │ 4 40 45
5 │ 6 10 missing
6 │ 5 missing 30
Then we transform the data frame by passing each row to a function that ignores missing values and keeps the minimum:
julia> transform(df, AsTable([:Value, :Value_1]) =>
ByRow(minimum∘skipmissing))
6×4 DataFrame
Row │ ID Value Value_1 Value_Value_1_minimum_skipmissing
│ Int64 Int64? Int64? Int64
─────┼────────────────────────────────────────────────────────────
1 │ 1 10 15 10
2 │ 2 20 25 20
3 │ 3 30 5 5
4 │ 4 40 45 40
5 │ 6 10 missing 10
6 │ 5 missing 30 30
This uses AsTable
so that the two values are passed to the processing function as a named tuple rather than two separate arguments.
You could also do the second step without using the DataFrame minilanguage:
df.min = minimum.(skipmissing.(eachrow(df[:, r"Value"])))
# Result:
6×4 DataFrame
Row │ ID Value Value_1 min
│ Int64 Int64? Int64? Int64
─────┼────────────────────────────────
1 │ 1 10 15 10
2 │ 2 20 25 20
3 │ 3 30 5 5
4 │ 4 40 45 40
5 │ 6 10 missing 10
6 │ 5 missing 30 30
1 Like
I don’t know if it’s more efficient, but it’s another way of achieving the same result
df=vcat(df1,df2)
gdf=groupby(df,:ID)
combine(gdf,:Value=>x->min(x...))
# or
combine(gdf,:Value=>Base.splat(min))
2 Likes
Here is a version with DataFramesMeta.jl
julia> df = @chain df1 begin
leftjoin(df2; on = :ID, renamecols = "_1" => "_2")
@rtransform :min_Value = min(:Value_1, :Value_2)
end
5×4 DataFrame
Row │ ID Value_1 Value_2 min_Value
│ Int64 Int64 Int64? Int64?
─────┼────────────────────────────────────
1 │ 1 10 15 10
2 │ 2 20 25 20
3 │ 3 30 5 5
4 │ 4 40 45 40
5 │ 6 10 missing missing
julia> # To handle missings
df = @chain df1 begin
leftjoin(df2; on = :ID, renamecols = "_1" => "_2")
@rtransform :min_Value = minimum(skipmissing(AsTable([:Value_1, :Value_2])))
end
5×4 DataFrame
Row │ ID Value_1 Value_2 min_Value
│ Int64 Int64 Int64? Int64
─────┼────────────────────────────────────
1 │ 1 10 15 10
2 │ 2 20 25 20
3 │ 3 30 5 5
4 │ 4 40 45 40
5 │ 6 10 missing 10
Are you coming from Stata, by chance?
I don’t know why, this form, which was the first one I tried, didn’t work for me.
I tried it again now with the intention of asking on the forum or opening an issue, if necessary.
In fact, as well as being of more general validity (also valid for many “almost equal” DFs), it also seems more Julianic to me
combine(gdf,:Value=>minimum)
Dan
February 27, 2024, 5:14pm
6
Another method, bypassing some of the standard DataFrame wrangling machinary:
let
dd = mergewith(min,(Dict(d.ID .=> d.Value) for d in (df1,df2))...)
sort!(DataFrame(ID=collect(keys(dd)), Value=collect(values(dd))), :ID)
end
Giving again:
6×2 DataFrame
Row │ ID Value
│ Int64 Int64
─────┼──────────────
1 │ 1 10
2 │ 2 20
3 │ 3 5
4 │ 4 40
5 │ 5 30
6 │ 6 10