I am trying to write a function that will efficiently delete the rows in a large DataFrame if it exists in a separate larger DataFrame.
So the function look something like…
function removedup!(Tables.namedtupleiterator(NewDF), Tables.namedtupleiterator(dataDF))
for row in NewDF
in(row, dataDF) && # delete row from NewDF
But I can’t quite figure out how to do this in place. Based on other posts I think for a DataFrame alone it would look something like?
function removedup!(NewDF, dataDF)
for row in eachrow(NewDF)
in(row, eachrow(dataDF)) && deleteat!(NewDF, findall(NewDF.col1 .== row.col1 .&& NewDF.col2 .== row.col2 .&& NewDF.col3 .== row.col3))
But as the
DataFrame is large I think passing the argument as
Tables.namedtupleiterator improves row iteration efficiency. Also I don’t know if there is a way to do it by passing the entire row without listing out each individual column? My understanding is that
DataFrames doesn’t have row indexing so I don’t know if that would work.
I am not an expert so my answer might not be that efficient. One way is to create a unique key for each row (e.g., concatenate all the columns separated by a unique character) for both datasets and then simply use the filter function.
In terms of deletion, I think your best bet is to get all of the row indices that are duplicates, and then call
But in terms of efficiency, if
DataDF is sufficiently large, I would guess that most of your time is spent searching it row by row. The way you’re doing it now, you basically have to iterate through every row of
DataDF for each row of
Are there any columns that are most likely to be unique? Maybe you could do an interactive approach where you check each column and bail of any of them are unique. An alternative is if there are one or more columns that are repeated, you could
groupby those, and then index into the grouped data frame to reduce your search space.
you could do
leftjoin! of both tables (assuming larger table does not have duplicates). This should be efficient.
I would have said this is a task for antijoin
Indeed. It is a better solution. It will not be in-place but I assume that OP would accept this.
awesome! Thanks for all the helpful approaches and useful suggestions. The DataFrame being used as a reference is a large
Arrow file. Am I correct in assuming that I can avoid bringing the arrow table into memory with the
The following seems to work but just wanted to make sure I wasn’t missing anything.
NewDF = antijoin(NewDF, DataDF, on = intersect(names(NewDF), names(DataDF))
Oh wow, I didn’t know antijoin was a thing. Neat!