Delete row from DataFrame in place based on entire row value

phantom · April 2, 2023, 1:59am

I am trying to write a function that will efficiently delete the rows in a large DataFrame if it exists in a separate larger DataFrame.

So the function look something like…

function removedup!(Tables.namedtupleiterator(NewDF), Tables.namedtupleiterator(dataDF))
    for row in NewDF 
        in(row, dataDF) && # delete row from NewDF
    end
end

But I can’t quite figure out how to do this in place. Based on other posts I think for a DataFrame alone it would look something like?

function removedup!(NewDF, dataDF)
    for row in eachrow(NewDF) 
        in(row, eachrow(dataDF)) && deleteat!(NewDF, findall(NewDF.col1 .== row.col1 .&& NewDF.col2 .== row.col2 .&& NewDF.col3 .== row.col3))
    end
end

But as the DataFrame is large I think passing the argument as Tables.namedtupleiterator improves row iteration efficiency. Also I don’t know if there is a way to do it by passing the entire row without listing out each individual column? My understanding is that DataFrames doesn’t have row indexing so I don’t know if that would work.

manuka · April 2, 2023, 11:07am

I am not an expert so my answer might not be that efficient. One way is to create a unique key for each row (e.g., concatenate all the columns separated by a unique character) for both datasets and then simply use the filter function.

kevbonham · April 2, 2023, 11:32am

In terms of deletion, I think your best bet is to get all of the row indices that are duplicates, and then call deleteat! once.

But in terms of efficiency, if DataDF is sufficiently large, I would guess that most of your time is spent searching it row by row. The way you’re doing it now, you basically have to iterate through every row of DataDF for each row of NewDF.

Are there any columns that are most likely to be unique? Maybe you could do an interactive approach where you check each column and bail of any of them are unique. An alternative is if there are one or more columns that are repeated, you could groupby those, and then index into the grouped data frame to reduce your search space.

bkamins · April 2, 2023, 1:26pm

you could do leftjoin! of both tables (assuming larger table does not have duplicates). This should be efficient.

rocco_sprmnt21 · April 2, 2023, 5:12pm

I would have said this is a task for antijoin

bkamins · April 2, 2023, 9:10pm

Indeed. It is a better solution. It will not be in-place but I assume that OP would accept this.

phantom · April 3, 2023, 11:21pm

awesome! Thanks for all the helpful approaches and useful suggestions. The DataFrame being used as a reference is a large Arrow file. Am I correct in assuming that I can avoid bringing the arrow table into memory with the antijoin approach?

The following seems to work but just wanted to make sure I wasn’t missing anything.

DataDF= DataFrame(Arrow.Table("file"))
NewDF = antijoin(NewDF, DataDF, on = intersect(names(NewDF), names(DataDF))

kevbonham · April 4, 2023, 12:02am

Oh wow, I didn’t know antijoin was a thing. Neat!

Topic		Replies	Views
JuliaDB delete row from table New to Julia juliadb	4	902	April 15, 2019
Delete duplicate rows in a DataFrame New to Julia dataframes	10	6013	June 22, 2023
Delete Id from dataframe if variable of id contains particular value General Usage dataframes	6	515	February 10, 2022
Delete all rows contained in a dataframe, as specified by an array of ids New to Julia	3	322	March 10, 2021
Delete rows in DataFrame Conditionally General Usage dataframes	4	1610	February 18, 2020

Delete row from DataFrame in place based on entire row value

Related topics