Delete duplicate rows in a DataFrame

In a post on Stackoverflow about identifying duplicate rows in a DataFrame [Stackoverflow Post], Dominykas Mostauskis made a side-comment about how to remove the rows identified. Unfortunately, the comment wasn’t that instructive for beginners.

So, how exactly does one go about removing identified unique rows from a DataFrame permanently?

Assuming you don’t actually want to identify the rows which are duplicates and therefore removed, you can just use unique?


julia> using DataFrames

julia> df = DataFrame(a = rand(1:3, 10), b = rand(1:3, 10))
10Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     3      1
   3 β”‚     2      1
   4 β”‚     1      3
   5 β”‚     1      1
   6 β”‚     1      1
   7 β”‚     3      1
   8 β”‚     2      2
   9 β”‚     3      1
  10 β”‚     1      2

julia> unique(df)
6Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     3      1
   3 β”‚     2      1
   4 β”‚     1      1
   5 β”‚     2      2
   6 β”‚     1      2
5 Likes

is there a way to get non unique dataframe ? thank you.

i have found a way to do this.

df2 = transform(df,nonunique)
df3 = filter(r -> r.x1 != 0, df2)

df3 gives data frame of non unique rows

This will give you exactly one instance of each duplicated row. So you won’t know if a row is duplicated 5 times or 500 times. This may or may not be what you want, though.

1 Like

is there a way to get whole rows as dataframe ?

This might be very slow for data sets with lots of columns, but you could try this:

julia> df = DataFrame(a = [1, 1, 2, 4], b = [10, 10, 25, 35]);

julia> combine(groupby(df, :)) do sdf
           if nrow(sdf) == 1 
               DataFrame()
           else
               sdf
           end
       end
2Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1     10
   2 β”‚     1     10
1 Like

thankyou @pdeffebach