Find unique row in DataFrame


#1

Hi All,

I am finding unique rows in a DataFrame and get number of repeat of unique rows, I could get by this:

t=DataFrame(a=rand(1:5,20),b=rand([:x,:y,:z],20))
ut=unique(t)

Int[countnz([t[j,:]==ut[i,:] for j in 1:size(t,1)]) for i in 1:size(ut,1)]

but I am wondering if there are more clean way, such as

Int[countnz(t.==r) for r in t]
or
Int[countnz(t.==r) for r in eachrow(t)]

right now, none of the above works, however it came to me as a natural way to broadcast each row and compare.

any thoughts?


#2

You can do this with a grouped operation

# you can group with a vector of symbols, so use all names in the DataFrame
by(t, names(t)) do # do is an easy way to do an anonymous function
       DataFrame(m = length(d[:a])) # just choose any column
end

With DataFramesMeta and Lazy, which is closest to R’s chaining imo, you can do

using DataFramesMeta, Lazy
t = @> t begin
    @by(names(t), n_copies = length(:a))
end

#3

@genauguy thanks for the tip, split-apply-combine is more elegant.


#4

@genauguy
btw, how do I get the indices of each row in grouped DataFrame, so that I can index back to the original ungrouped DataFrame. like this:

[find([t[j,:]==ut[i,:] for j in 1:size(t,1)]) for i in 1:size(ut,1)]

#5

Comparing rows across DataFrames is less straightforward than I thought, but it might be easier on master, I’m not sure.

Here’s the DataFramesMeta way:

t[:rownum] = [i for i in nrow(t)]
ut = @> t begin
     @by([:a,:b], places = [[i for i in :rownum]])
end

#6

@genauguy thanks, I did it this way:

cols=names(t)
t[:I] = 1:nrow(t)
by(t, cols,g->DataFrame(n=size(g,1), i=[g[:i]]))