Discarding rows of dataframe base on condition on column of another dataframe

mocalvao · December 23, 2023, 1:52pm

Hi there,
I have two dataframes:

df1=DataFrame(Student=["Abel", "Beatriz", "Beatriz", "Caio", "Dario", "Dario"],Class=["12", "12", "10", "11", "13", "14"], Professor=["John", "John", "Jack", "Moe", "Heather", "Lisa"], P1=[7.8, 4.2, 6.9, 3.0, 1.3, 8.2])

6×4 DataFrame
 Row │ Student  Class   Professor  P1      
     │ String   String  String     Float64 
─────┼─────────────────────────────────────
   1 │ Abel     12      John           7.8
   2 │ Beatriz  12      John           4.2
   3 │ Beatriz  10      Jack           6.9
   4 │ Caio     11      Moe            3.0
   5 │ Dario    13      Heather        1.3
   6 │ Dario    14      Lisa           8.2

and

df2=DataFrame(Student=["Abel", "Beatriz", "Caio", "Beatriz", "Dario", "Dario"],Class=["12", "12", "11", "10", "13", "14"], Status=["Active
", "Active", "Active", "Inactive", "Inactive", "Active"])

6×3 DataFrame
 Row │ Student  Class   Status   
     │ String   String  String   
─────┼───────────────────────────
   1 │ Abel     12      Active
   2 │ Beatriz  12      Active
   3 │ Caio     11      Active
   4 │ Beatriz  10      Inactive
   5 │ Dario    13      Inactive
   6 │ Dario    14      Active

I would like to “clean” df1, so that its rows which correspond to df2.Status.==“Inactive” are suppressed and the final df1 dataframe turns out to be:

df1_cleaned = DataFrame(Student=["Abel", "Beatriz", "Caio", "Dario"], Class=["12", "12", "11", "14"], Professor=["John", "John", "Moe", "Lisa"], P1=["7.8", "4.2", "3.0", "8.2"])

4×4 DataFrame
 Row │ Student  Class   Professor  P1      
     │ String   String  String     Float64 
─────┼─────────────────────────────────────
   1 │ Abel     12      John           7.8
   2 │ Beatriz  12      John           4.2
   3 │ Caio     11      Moe            3.0
   4 │ Dario    14      Lisa           8.2

Things I have tried involved:

unique function
join DataFrames operations

but I really could not find my way out…
Thanks in advance.

bertschi · December 23, 2023, 2:07pm

@chain df1 begin
    leftjoin(df2, on = [:Student, :Class])
    subset(:Status => s -> .!(s .== "Inactive"))
    select(names(df1))
end

Dan · December 23, 2023, 2:30pm

Just to make things more self-contained, to use the @chain macro, you need the Chain.jl package. The macro starts with df1 and makes the result of each line the first argument of the previous line so:

# install Chain with: `using Pkg; Pkg.add("Chain")`
using Chain
@chain df1 begin
    leftjoin(df2, on = [:Student, :Class])
    subset(:Status => s -> .!(s .== "Inactive"))
    select(names(df1))
end

is equivalent to:

tmp1 = leftjoin(df1, df2, on = [:Student, :Class])
tmp2 = subset(tmp1, :Status => s -> .!(s .== "Inactive"))
select(tmp2, names(df1))

mocalvao · December 23, 2023, 2:38pm

Dear Nils,
Despite your being first in answering, I will choose Dan’s answer for solution because it is a little bit more explanatory and self-sufficient , for a beginner such as me (I tried your code block as is, just to notice that I had to install the Chain.jl package; not such a big deal, anyway…). Thanks

mocalvao · December 23, 2023, 2:39pm

Good explanation, Dan!

CeterisPartybus · December 24, 2023, 10:26am

You can also do this without any additional package in one line:

rightjoin(df1,df2[df2.Status .== "Active", Not(:Status)], on = [:Student, :Class])

I just pre-select rows in df2 that are “Active” and then join the two data frames. I use rightjoin to only select rows that are also present in df2 and disregard all rows in df1 that do not have a matching column in df2. The “Not(:Status)” tells Julia to take all columns but the Status column.

The or in a full self-contained version:

# Load Package
using DataFrames

# Input data frames
df1=DataFrame(Student=["Abel", "Beatriz", "Beatriz", "Caio", "Dario", "Dario"],Class=["12", "12", "10", "11", "13", "14"], Professor=["John", "John", "Jack", "Moe", "Heather", "Lisa"], P1=[7.8, 4.2, 6.9, 3.0, 1.3, 8.2])
df2=DataFrame(Student=["Abel", "Beatriz", "Caio", "Beatriz", "Dario", "Dario"],Class=["12", "12", "11", "10", "13", "14"], Status=["Active
", "Active", "Active", "Inactive", "Inactive", "Active"])

# Generate new data frame
df1_cleaned = rightjoin(df1,df2[df2.Status .== "Active", Not(:Status)], on = [:Student, :Class])

rocco_sprmnt21 · December 24, 2023, 12:07pm

a slightly different way


except=Tables.rows(df2[df2.Status .== "Inactive",[1,2]])

all=Tables.rows(df1[:,[1,2]])

delete!(df1,indexin(except,all))
#or 
df1[Not(indexin(except,all)),:]

Topic		Replies	Views
Dataframes: How to conditionally remove rows based on data values? General Usage	4	8226	October 1, 2020
Differences in Two DataFrames General Usage dataframes	5	898	November 11, 2022
How to create a new DF with only the rows that contain specific values from two different columns New to Julia question	5	89	September 15, 2024
Subset a dataframe by column of another dataframe Data question	4	2216	March 1, 2021
Delete row(s) of a dataframe based on dates in a vector New to Julia	2	221	September 23, 2022

Discarding rows of dataframe base on condition on column of another dataframe

Related topics