Find unique row in DataFrame


Hi All,

I am finding unique rows in a DataFrame and get number of repeat of unique rows, I could get by this:


Int[countnz([t[j,:]==ut[i,:] for j in 1:size(t,1)]) for i in 1:size(ut,1)]

but I am wondering if there are more clean way, such as

Int[countnz(t.==r) for r in t]
Int[countnz(t.==r) for r in eachrow(t)]

right now, none of the above works, however it came to me as a natural way to broadcast each row and compare.

any thoughts?


You can do this with a grouped operation

# you can group with a vector of symbols, so use all names in the DataFrame
by(t, names(t)) do # do is an easy way to do an anonymous function
       DataFrame(m = length(d[:a])) # just choose any column

With DataFramesMeta and Lazy, which is closest to R’s chaining imo, you can do

using DataFramesMeta, Lazy
t = @> t begin
    @by(names(t), n_copies = length(:a))


@genauguy thanks for the tip, split-apply-combine is more elegant.


btw, how do I get the indices of each row in grouped DataFrame, so that I can index back to the original ungrouped DataFrame. like this:

[find([t[j,:]==ut[i,:] for j in 1:size(t,1)]) for i in 1:size(ut,1)]


Comparing rows across DataFrames is less straightforward than I thought, but it might be easier on master, I’m not sure.

Here’s the DataFramesMeta way:

t[:rownum] = [i for i in nrow(t)]
ut = @> t begin
     @by([:a,:b], places = [[i for i in :rownum]])


@genauguy thanks, I did it this way:

t[:I] = 1:nrow(t)
by(t, cols,g->DataFrame(n=size(g,1), i=[g[:i]]))