babaq
May 17, 2018, 12:53am
1
Hi All,
I am finding unique rows in a DataFrame and get number of repeat of unique rows, I could get by this:
t=DataFrame(a=rand(1:5,20),b=rand([:x,:y,:z],20))
ut=unique(t)
Int[countnz([t[j,:]==ut[i,:] for j in 1:size(t,1)]) for i in 1:size(ut,1)]
but I am wondering if there are more clean way, such as
Int[countnz(t.==r) for r in t]
or
Int[countnz(t.==r) for r in eachrow(t)]
right now, none of the above works, however it came to me as a natural way to broadcast each row and compare.
any thoughts?
You can do this with a grouped operation
# you can group with a vector of symbols, so use all names in the DataFrame
by(t, names(t)) do # do is an easy way to do an anonymous function
DataFrame(m = length(d[:a])) # just choose any column
end
With DataFramesMeta
and Lazy
, which is closest to R’s chaining imo, you can do
using DataFramesMeta, Lazy
t = @> t begin
@by(names(t), n_copies = length(:a))
end
babaq
May 17, 2018, 4:25am
3
@pdeffebach thanks for the tip, split-apply-combine is more elegant.
babaq
May 17, 2018, 5:44am
4
@pdeffebach
btw, how do I get the indices of each row in grouped DataFrame, so that I can index back to the original ungrouped DataFrame. like this:
[find([t[j,:]==ut[i,:] for j in 1:size(t,1)]) for i in 1:size(ut,1)]
Comparing rows across DataFrames is less straightforward than I thought, but it might be easier on master
, I’m not sure.
Here’s the DataFramesMeta
way:
t[:rownum] = [i for i in nrow(t)]
ut = @> t begin
@by([:a,:b], places = [[i for i in :rownum]])
end
babaq
May 17, 2018, 9:29pm
6
@pdeffebach thanks, I did it this way:
cols=names(t)
t[:I] = 1:nrow(t)
by(t, cols,g->DataFrame(n=size(g,1), i=[g[:i]]))