In a post on Stackoverflow about identifying duplicate rows in a DataFrame [Stackoverflow Post], Dominykas Mostauskis made a side-comment about how to remove the rows identified. Unfortunately, the comment wasnβt that instructive for beginners.
So, how exactly does one go about removing identified unique rows from a DataFrame permanently?
This will give you exactly one instance of each duplicated row. So you wonβt know if a row is duplicated 5 times or 500 times. This may or may not be what you want, though.
Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.
df = DataFrame(a = [1, 1, 2, 4], b = [10, 10, 25, 35], c = [2,3,5,7])
function dropduplicates(df, cols; keep = :first)
keep in [:first, :last] || error("keep parameter should be :first or :last")
combine(groupby(df, cols)) do sdf
if nrow(sdf) == 1
DataFrame()
else
DataFrame(
filter(
r->rownumber(r)==(keep == :first ? 1 : nrow(sdf)),
eachrow(sdf)
)
)
end
end
end
Giving:
julia> df
4Γ3 DataFrame
Row β a b c
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 10 2
2 β 1 10 3
3 β 2 25 5
4 β 4 35 7
julia> dropduplicates(df, [:a, :b])
1Γ3 DataFrame
Row β a b c
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 10 2
julia> dropduplicates(df, [:a, :b]; keep = :last)
1Γ3 DataFrame
Row β a b c
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 10 3