I am trying to write a function that will efficiently delete the rows in a large DataFrame if it exists in a separate larger DataFrame.
So the function look something like…
function removedup!(Tables.namedtupleiterator(NewDF), Tables.namedtupleiterator(dataDF))
for row in NewDF
in(row, dataDF) && # delete row from NewDF
But I can’t quite figure out how to do this in place. Based on other posts I think for a DataFrame alone it would look something like?
function removedup!(NewDF, dataDF)
for row in eachrow(NewDF)
in(row, eachrow(dataDF)) && deleteat!(NewDF, findall(NewDF.col1 .== row.col1 .&& NewDF.col2 .== row.col2 .&& NewDF.col3 .== row.col3))
But as the DataFrame is large I think passing the argument as Tables.namedtupleiterator improves row iteration efficiency. Also I don’t know if there is a way to do it by passing the entire row without listing out each individual column? My understanding is that DataFrames doesn’t have row indexing so I don’t know if that would work.
I am not an expert so my answer might not be that efficient. One way is to create a unique key for each row (e.g., concatenate all the columns separated by a unique character) for both datasets and then simply use the filter function.
In terms of deletion, I think your best bet is to get all of the row indices that are duplicates, and then call deleteat! once.
But in terms of efficiency, if DataDF is sufficiently large, I would guess that most of your time is spent searching it row by row. The way you’re doing it now, you basically have to iterate through every row of DataDF for each row of NewDF.
Are there any columns that are most likely to be unique? Maybe you could do an interactive approach where you check each column and bail of any of them are unique. An alternative is if there are one or more columns that are repeated, you could groupby those, and then index into the grouped data frame to reduce your search space.
awesome! Thanks for all the helpful approaches and useful suggestions. The DataFrame being used as a reference is a large Arrow file. Am I correct in assuming that I can avoid bringing the arrow table into memory with the antijoin approach?
The following seems to work but just wanted to make sure I wasn’t missing anything.
NewDF = antijoin(NewDF, DataDF, on = intersect(names(NewDF), names(DataDF))