Delete duplicate rows in a DataFrame

Nash · July 29, 2021, 10:06am

In a post on Stackoverflow about identifying duplicate rows in a DataFrame [Stackoverflow Post], Dominykas Mostauskis made a side-comment about how to remove the rows identified. Unfortunately, the comment wasn’t that instructive for beginners.

So, how exactly does one go about removing identified unique rows from a DataFrame permanently?

nilshg · July 29, 2021, 10:10am

Assuming you don’t actually want to identify the rows which are duplicates and therefore removed, you can just use unique?


julia> using DataFrames

julia> df = DataFrame(a = rand(1:3, 10), b = rand(1:3, 10))
10×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     3      1
   3 │     2      1
   4 │     1      3
   5 │     1      1
   6 │     1      1
   7 │     3      1
   8 │     2      2
   9 │     3      1
  10 │     1      2

julia> unique(df)
6×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     3      1
   3 │     2      1
   4 │     1      1
   5 │     2      2
   6 │     1      2

sai_matcha · February 14, 2022, 5:50pm

is there a way to get non unique dataframe ? thank you.

sai_matcha · February 14, 2022, 6:26pm

i have found a way to do this.

df2 = transform(df,nonunique)
df3 = filter(r -> r.x1 != 0, df2)

df3 gives data frame of non unique rows

pdeffebach · February 14, 2022, 6:36pm

This will give you exactly one instance of each duplicated row. So you won’t know if a row is duplicated 5 times or 500 times. This may or may not be what you want, though.

sai_matcha · February 14, 2022, 6:43pm

is there a way to get whole rows as dataframe ?

pdeffebach · February 14, 2022, 6:49pm

This might be very slow for data sets with lots of columns, but you could try this:

julia> df = DataFrame(a = [1, 1, 2, 4], b = [10, 10, 25, 35]);

julia> combine(groupby(df, :)) do sdf
           if nrow(sdf) == 1 
               DataFrame()
           else
               sdf
           end
       end
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1     10
   2 │     1     10

sai_matcha · February 14, 2022, 7:02pm

thankyou @pdeffebach

bert · November 23, 2022, 1:25am

Is there a way to keep either the first or last duplicate? Like Python’s pandas drop_duplicates:

keep {‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

Dan · November 23, 2022, 2:30am

df = DataFrame(a = [1, 1, 2, 4], b = [10, 10, 25, 35], c = [2,3,5,7])

function dropduplicates(df, cols; keep = :first)
    keep in [:first, :last] || error("keep parameter should be :first or :last")
    combine(groupby(df, cols)) do sdf
        if nrow(sdf) == 1 
            DataFrame()
        else
            DataFrame(
              filter(
                r->rownumber(r)==(keep == :first ? 1 : nrow(sdf)), 
                eachrow(sdf)
              )
            )
        end
    end
end

Giving:

julia> df
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1     10      2
   2 │     1     10      3
   3 │     2     25      5
   4 │     4     35      7

julia> dropduplicates(df, [:a, :b])
1×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1     10      2

julia> dropduplicates(df, [:a, :b]; keep = :last)
1×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1     10      3

(maybe this is not the most efficient way)

hhaensel · June 22, 2023, 6:11am

Just to add on that, unique() supports - perhaps meanwhile - cols and keep, and it is very fast as it creates a view on rows by default.

Topic		Replies	Views
Filtering dataframe for unique rows with respect one of column New to Julia question , dataframes	1	51	July 18, 2024
Remove all entries that occur more than once New to Julia dataframes	3	425	February 18, 2022
Delete rows in DataFrame Conditionally General Usage dataframes	4	1619	February 18, 2020
Delete all rows contained in a dataframe, as specified by an array of ids New to Julia	3	326	March 10, 2021
Find unique row in DataFrame General Usage	5	1649	May 17, 2018

Delete duplicate rows in a DataFrame

Related topics