Take data from one DataFrame to another based on a conditional

The problem I’m having is I can’t figure out how to move data from one DataFrame to a different one. Let’s say I have two DataFrames.
df1 is:
Team Rating
Chi 10
Ny 8
Oak 6
Ott 6
df2 is:
HomeTeam HomeRat AwayTeam AwayRat
Chi missing Oak missing
Ny missing Ott missing

I want to be able to check if the home team or away the team is in df2 is the same as in Team in df1 and if it is move that value into the correct column. I’ve been trying over and over to try and figure this out with for loops and broadcasting but I just can’t get it.

What you are asking for is often referred to as an update functionality. I think this isn’t yet available in DataFrames, but there’s an issue for it somewhere (on my phone so a bit cumbersome to check).

The way to do it currently is:

leftjoin(df2, df1, on = :HomeTeam => :Team) 

And then if you want to replace the missing values in the HomeRat column do

df2.HomeRat = coalesce.(df2.HomeRat, df2.Rating)

You can then do the same joining on the :AwayTeam column in the first step.

An alternative approach is to create a dictionary from your first DataFrame and then simply replace the columns in the second DataFrame like this:

df1dict = Dict(df1.Team .=> df1.Rating)
df2.HomeRat = [df1dict[x] for x in df2.HomeTeam]
df2.AwayRat = [df1dict[x] for x in df2.AwayTeam]
2 Likes

We do not have update! yet but one can do leftjoin! which is in-place:

julia> df1 = DataFrame(Team=["Chi", "Ny", "Oak", "Ott"], Rating=[10, 8, 6, 4])
4×2 DataFrame
 Row │ Team    Rating
     │ String  Int64
─────┼────────────────
   1 │ Chi         10
   2 │ Ny           8
   3 │ Oak          6
   4 │ Ott          4

julia> df2 = DataFrame(HomeTeam=["Chi", "Ny"], AwayTeam=["Oak", "Ott"])
2×2 DataFrame
 Row │ HomeTeam  AwayTeam
     │ String    String
─────┼────────────────────
   1 │ Chi       Oak
   2 │ Ny        Ott

julia> leftjoin!(df2, rename(df1, "Rating" => "HomeRat"), on=:HomeTeam=>:Team)
2×3 DataFrame
 Row │ HomeTeam  AwayTeam  HomeRat
     │ String    String    Int64?
─────┼─────────────────────────────
   1 │ Chi       Oak            10
   2 │ Ny        Ott             8

julia> leftjoin!(df2, rename(df1, "Rating" => "AwayRat"), on=:AwayTeam=>:Team)
2×4 DataFrame
 Row │ HomeTeam  AwayTeam  HomeRat  AwayRat
     │ String    String    Int64?   Int64?
─────┼──────────────────────────────────────
   1 │ Chi       Oak            10        6
   2 │ Ny        Ott             8        4
2 Likes

I have found this leftjoin_or_fill! function useful, although I can’t quite tell if it’s what you want here.

"""
    leftjoin_or_fill!(df1, df2; on)

Adds columns from `df2` to `df1` using `leftjoin!`, except if there are duplicate columns
between `df1` and `df2`, then [`choose_nonmissing`](@ref) is used to merge them.
"""
function leftjoin_or_fill!(df1, df2; on)
    if !(on isa AbstractVector)
        on = [on]
    end
    shared_cols = setdiff(intersect(names(df1), names(df2)), string.(on))
    leftjoin!(df1, df2; on, makeunique=true)
    for col in shared_cols
        df1[!, col] = choose_nonmissing.(df1[!, col], df1[!, string(col, "_1")]; col)
    end
    select!(df1, Not([string(col, "_1") for col in shared_cols]))
    return df1
end

where

"""
    choose_nonmissing(x, y; col="")

* If exactly one of `x` or `y` is non-`missing`, return it.
* If both are missing, return `missing`
* If both are non-missing, and `x != y`, throw an error. Otherwise return `x`.
"""
function choose_nonmissing(x, y; col="")
    if ismissing(x)
        return y
    elseif ismissing(y)
        return x
    else
        if x != y
            error("Given non-missing values $x and $y which should be equal for column $col.")
        end
        return x
    end
end
julia> df1 = DataFrame(Team=["Chi", "Ny", "Oak", "Ott"], Rating=[10, 8, 6, 4])
4×2 DataFrame
 Row │ Team    Rating
     │ String  Int64
─────┼────────────────
   1 │ Chi         10
   2 │ Ny           8
   3 │ Oak          6
   4 │ Ott          4

julia> df2 = DataFrame(HomeTeam=["Chi", "Ny"],AwayTeam=["Oak", "Ott"])
2×2 DataFrame
 Row │ HomeTeam  AwayTeam
     │ String    String
─────┼────────────────────
   1 │ Chi       Oak
   2 │ Ny        Ott

julia>

julia>

julia> unstack(leftjoin!(df1, stack(df2,[:HomeTeam,  :AwayTeam]), on=:Team=>:value),:variable, :Rating)
4×3 DataFrame
 Row │ Team    HomeTeam  AwayTeam     │ String  Int64?    Int64?
─────┼────────────────────────────
   1 │ Chi           10   missing
   2 │ Ny             8   missing
   3 │ Oak      missing         6
   4 │ Ott      missing         4

Thank you all so much for the for the help. I had to be away from my computer for a couple days but I was able to do what I was trying to do pretty quickly because of you all. If you’re interested this is what I ended up doing.

Allteams = CSV.read(“AllTeamPre.csv”, DataFrame, stringtype=String)
Sch = CSV.read(“CollegeFootballSch22.csv”, DataFrame, stringtype=String)
Allteams.Adj = coalesce.(Allteams.Adj, 0)
Allteams.Rating = (Allteams.Talent .+ Allteams.Adj)
dropmissing!(Allteams)
powerconf = (“ACC”, “AAC”, “Big Ten”, “Big 12”, “Mountain West”, “Pac-12” , “SEC”)
subset!(Sch, :home_conference => ByRow(∈(powerconf)); skipmissing=true)
subset!(Sch, :away_conference => ByRow(∈(powerconf)); skipmissing=true)
RatingDict = Dict(Allteams.team .=> Allteams.Rating)
Sch.home_rat = [RatingDict for x in Sch.home_team]
Sch.away_rat = [RatingDict for x in Sch.away_team]
PowerConfSch = select(Sch,
:week,
:neutral_site,
:home_team,
:home_rat,
:away_team,
:away_rat
)