R function `duplicated`

Hello,

For a vector v, the R function duplicated works as follows. The vector duplicated(v) has the same length as v, and its i-th element is false if and only if v[i] is the first occurence of v[i] in v. For example duplicated([1, 2, 1, 3, 2]) = [false, false, true, false, true]. I implemented it as follows in Julia:

function duplicated(x)
    out = fill(false, length(x))
    for i in 1:(length(x)-1)
        if !out[i]
            out[i .+ findall(x[i] .== x[(i+1):length(x)])] .= true
        end
    end
    return out
end

Can we improve it?

this feels like an intermediate masking kind of thing that’s more useful for R/Python than Julia, what are you gonna do with this vector?

1 Like

Something like this (untested code) will make it run in O(n log n) instead of O(n^2):

1 Like

To remove the duplicates of a vector you can do v[!duplicated(v)]. Well, in this case this is equivalent to unique(v), but this can be used for another vector: x[!duplicated(v)] (useful for example v = score.(x) for a function score).

Thanks. Are you sure it is better? With my function, elements marked as duplicates are not tested a second time in the next tests.

unique(score, x)
1 Like

Didn’t know that, thanks. But I use it for removing the rows of a matrix which have the same “score”: x[!duplicated([score(row) for row in eachrow(x)]), :].

Ah I see, I could use unique(score, collect(eachrow(x))) instead and reconstruct a matrix with these rows. But this should be less efficient no?

Couldn’t you do

i = unique(j -> v[j], eachindex(v))
v[i]

for this purpose? This gives you an index array that you could re-use to extract corresponding slices of other arrays too.

See also the discussions at Return index vectors from unique · Issue #1845 · JuliaLang/julia · GitHub and
Is there a function similar to numpy unique with inverse? - #6 by stevengj

1 Like
unique!.(score, eachrow(x))

this modifies x in-place instead of making copies

In DataFrames.jl you can do:

julia> df = DataFrame(x = rand(1:10^6, 10^6));

julia> @time nonunique(df, :x);
  0.034752 seconds (52 allocations: 24.585 MiB)

(which gives you a Bool vector exactly as you want and is slightly faster than the unique version that @stevengj proposed)

Yes, I finally had the same idea. More generally unique(j -> score(v[j]), eachindex(v)).

Wouldn’t it be rather \mathcal{O}(n\cdot k) with n the length of the input and k the number of its unique elements? Which is n^2 if all elements are unique.

You have n to loop over the values. For each value you do a set lookup and a possible set insertion (and setting the boolean value in out, which is clearly constant time). I mentioned log n thinking that a set might be a binary tree. If instead a set is implemented as a hash (likely), then insertion and lookup could be constant time, so its possibly (probable) just O(n). I’m not at a computer so it’s not convenient to lookup the set implementation details.

1 Like

Ah, you’re of course totally right. Sorry, that wasn’t very clever of me🤦‍♂️. It is a hasmap by the way, so O(n) it is!