Checking for unique rows in classification

Hi all!

I’m very new to Julia as well as coding. I’m wanting to prep my data for a classification task. I’m trying to write a function that will look through an object of type DataFrame and find any duplicate rows that have different values for the classification column (since it would be silly to try and classify those rows if the same inputs result in a different classification). I started to write a for loop that would check, but then I thought that wouldn’t work because there might be a row at index 3 and another at 30 that are (with the exception of the classification column) identical that the for loop wouldn’t catch (at least I think it wouldn’t…). I’ll attach what I wrote so far, but I’m not really sure where to go with this one… Any advice would be much appreciated!

function class_check(df)
    print("Enter name of classification column \n\n")
    class_name = readline()
    
    for i = 1:size(df, 2)
        for j = 1:size(df, 1)
            if df[j, :class_name] == df[j+1, :class_name]
                print("Error")
            end
        end
    end
    
end 

One thing to note, :class_name and class_name are different things. The first is a Symbol, which is kind of like a String, and the second is the actual value you get from readline().

I’m not sure I agree with your premise - it’s not uncommon in regression and classification tasks for the same covariates to produce different outcomes, as in general you wouldn’t expect to perfectly observe all covariates (and hence have no error in the outcome).

In any case, it sounds like you might be interested in a groupby operation on all your covariates, something like:

julia> df = DataFrame(y = [1, 2, 3, 4, 5], x1 = ["a", "b", "c", "c", "e"], x2 = ["f", "g", "d", "d", "e"])
5×3 DataFrame
 Row │ y      x1      x2     
     │ Int64  String  String 
─────┼───────────────────────
   1 │     1  a       f
   2 │     2  b       g
   3 │     3  c       d
   4 │     4  c       d
   5 │     5  e       e

julia> combine(groupby(df, [:x1, :x2]), :y => Ref => :y)
4×3 DataFrame
 Row │ x1      x2      y         
     │ String  String  SubArray… 
─────┼───────────────────────────
   1 │ a       f       [1]
   2 │ b       g       [2]
   3 │ c       d       [3, 4]
   4 │ e       e       [5]

When doing this you have to decide what to do with the multiple different outcomes (you could replace Ref by e.g. first to keep the first observed y, or mean to get the average, or whatever other function is appropriate in your case).

2 Likes

is there any possibility to get the no of repeatations if also from and to are same number
df = DataFrame(y = [1, 2, 3, 4, 5], x1 = [“a”, “b”, “c”, “d”, “e”], x2 = [“f”, “g”, “d”, “c”, “e”])

for example c to d and d to c can be counted as repeatations?

you can use Set if you ‘want to ignore the order’.
I am sure there are more elegant solutions, but maybe this is what you are looking for:

df = DataFrame(y = [1, 2, 3, 4, 5], x1 = ["a", "b", "c", "d", "e"], x2 = ["f", "g", "d", "c", "e"])
df.id = Set.(string.(df.x1,df.x2))
rs = combine(groupby(df, :id), :y => Ref => :y)