Delete duplicate rows in a DataFrame

In a post on Stackoverflow about identifying duplicate rows in a DataFrame [Stackoverflow Post], Dominykas Mostauskis made a side-comment about how to remove the rows identified. Unfortunately, the comment wasn’t that instructive for beginners.

So, how exactly does one go about removing identified unique rows from a DataFrame permanently?

1 Like

Assuming you don’t actually want to identify the rows which are duplicates and therefore removed, you can just use unique?


julia> using DataFrames

julia> df = DataFrame(a = rand(1:3, 10), b = rand(1:3, 10))
10Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     3      1
   3 β”‚     2      1
   4 β”‚     1      3
   5 β”‚     1      1
   6 β”‚     1      1
   7 β”‚     3      1
   8 β”‚     2      2
   9 β”‚     3      1
  10 β”‚     1      2

julia> unique(df)
6Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     3      1
   3 β”‚     2      1
   4 β”‚     1      1
   5 β”‚     2      2
   6 β”‚     1      2
8 Likes

is there a way to get non unique dataframe ? thank you.

i have found a way to do this.

df2 = transform(df,nonunique)
df3 = filter(r -> r.x1 != 0, df2)

df3 gives data frame of non unique rows

1 Like

This will give you exactly one instance of each duplicated row. So you won’t know if a row is duplicated 5 times or 500 times. This may or may not be what you want, though.

1 Like

is there a way to get whole rows as dataframe ?

This might be very slow for data sets with lots of columns, but you could try this:

julia> df = DataFrame(a = [1, 1, 2, 4], b = [10, 10, 25, 35]);

julia> combine(groupby(df, :)) do sdf
           if nrow(sdf) == 1 
               DataFrame()
           else
               sdf
           end
       end
2Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1     10
   2 β”‚     1     10
2 Likes

thankyou @pdeffebach

Is there a way to keep either the first or last duplicate? Like Python’s pandas drop_duplicates:

keep {β€˜first’, β€˜last’, False}, default β€˜first’

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

df = DataFrame(a = [1, 1, 2, 4], b = [10, 10, 25, 35], c = [2,3,5,7])

function dropduplicates(df, cols; keep = :first)
    keep in [:first, :last] || error("keep parameter should be :first or :last")
    combine(groupby(df, cols)) do sdf
        if nrow(sdf) == 1 
            DataFrame()
        else
            DataFrame(
              filter(
                r->rownumber(r)==(keep == :first ? 1 : nrow(sdf)), 
                eachrow(sdf)
              )
            )
        end
    end
end

Giving:

julia> df
4Γ—3 DataFrame
 Row β”‚ a      b      c     
     β”‚ Int64  Int64  Int64 
─────┼─────────────────────
   1 β”‚     1     10      2
   2 β”‚     1     10      3
   3 β”‚     2     25      5
   4 β”‚     4     35      7

julia> dropduplicates(df, [:a, :b])
1Γ—3 DataFrame
 Row β”‚ a      b      c     
     β”‚ Int64  Int64  Int64 
─────┼─────────────────────
   1 β”‚     1     10      2

julia> dropduplicates(df, [:a, :b]; keep = :last)
1Γ—3 DataFrame
 Row β”‚ a      b      c     
     β”‚ Int64  Int64  Int64 
─────┼─────────────────────
   1 β”‚     1     10      3

(maybe this is not the most efficient way)

3 Likes

Just to add on that, unique() supports - perhaps meanwhile - cols and keep, and it is very fast as it creates a view on rows by default.

2 Likes