Checking for unique rows in classification

cwallj · August 23, 2021, 7:22pm

Hi all!

I’m very new to Julia as well as coding. I’m wanting to prep my data for a classification task. I’m trying to write a function that will look through an object of type DataFrame and find any duplicate rows that have different values for the classification column (since it would be silly to try and classify those rows if the same inputs result in a different classification). I started to write a for loop that would check, but then I thought that wouldn’t work because there might be a row at index 3 and another at 30 that are (with the exception of the classification column) identical that the for loop wouldn’t catch (at least I think it wouldn’t…). I’ll attach what I wrote so far, but I’m not really sure where to go with this one… Any advice would be much appreciated!

function class_check(df)
    print("Enter name of classification column \n\n")
    class_name = readline()
    
    for i = 1:size(df, 2)
        for j = 1:size(df, 1)
            if df[j, :class_name] == df[j+1, :class_name]
                print("Error")
            end
        end
    end
    
end

pdeffebach · August 23, 2021, 8:10pm

One thing to note, :class_name and class_name are different things. The first is a Symbol, which is kind of like a String, and the second is the actual value you get from readline().

nilshg · August 23, 2021, 10:21pm

I’m not sure I agree with your premise - it’s not uncommon in regression and classification tasks for the same covariates to produce different outcomes, as in general you wouldn’t expect to perfectly observe all covariates (and hence have no error in the outcome).

In any case, it sounds like you might be interested in a groupby operation on all your covariates, something like:

julia> df = DataFrame(y = [1, 2, 3, 4, 5], x1 = ["a", "b", "c", "c", "e"], x2 = ["f", "g", "d", "d", "e"])
5×3 DataFrame
 Row │ y      x1      x2     
     │ Int64  String  String 
─────┼───────────────────────
   1 │     1  a       f
   2 │     2  b       g
   3 │     3  c       d
   4 │     4  c       d
   5 │     5  e       e

julia> combine(groupby(df, [:x1, :x2]), :y => Ref => :y)
4×3 DataFrame
 Row │ x1      x2      y         
     │ String  String  SubArray… 
─────┼───────────────────────────
   1 │ a       f       [1]
   2 │ b       g       [2]
   3 │ c       d       [3, 4]
   4 │ e       e       [5]

When doing this you have to decide what to do with the multiple different outcomes (you could replace Ref by e.g. first to keep the first observed y, or mean to get the average, or whatever other function is appropriate in your case).

ayodyas · August 11, 2022, 1:40pm

is there any possibility to get the no of repeatations if also from and to are same number
df = DataFrame(y = [1, 2, 3, 4, 5], x1 = [“a”, “b”, “c”, “d”, “e”], x2 = [“f”, “g”, “d”, “c”, “e”])

for example c to d and d to c can be counted as repeatations?

bernhard · August 11, 2022, 2:35pm

you can use Set if you ‘want to ignore the order’.
I am sure there are more elegant solutions, but maybe this is what you are looking for:

df = DataFrame(y = [1, 2, 3, 4, 5], x1 = ["a", "b", "c", "d", "e"], x2 = ["f", "g", "d", "c", "e"])
df.id = Set.(string.(df.x1,df.x2))
rs = combine(groupby(df, :id), :y => Ref => :y)

Topic		Replies	Views
Find unique row in DataFrame General Usage	5	1649	May 17, 2018
Filtering dataframe for unique rows with respect one of column New to Julia question , dataframes	1	52	July 18, 2024
Changing many rows to single row julia1.5.3 Data question	8	594	December 13, 2020
Delete duplicate rows in a DataFrame New to Julia dataframes	10	6095	June 22, 2023
Remove all entries that occur more than once New to Julia dataframes	3	425	February 18, 2022

Checking for unique rows in classification

Related topics