R function `duplicated`

_stla · December 15, 2022, 5:29pm

Hello,

For a vector v, the R function duplicated works as follows. The vector duplicated(v) has the same length as v, and its i-th element is false if and only if v[i] is the first occurence of v[i] in v. For example duplicated([1, 2, 1, 3, 2]) = [false, false, true, false, true]. I implemented it as follows in Julia:

function duplicated(x)
    out = fill(false, length(x))
    for i in 1:(length(x)-1)
        if !out[i]
            out[i .+ findall(x[i] .== x[(i+1):length(x)])] .= true
        end
    end
    return out
end

Can we improve it?

jling · December 15, 2022, 5:35pm

this feels like an intermediate masking kind of thing that’s more useful for R/Python than Julia, what are you gonna do with this vector?

Jeff_Emanuel · December 15, 2022, 5:35pm

Something like this (untested code) will make it run in O(n log n) instead of O(n^2):

_stla:

function duplicated(x)
    out = fill(false, length(x))
   seen = Set{eltype(x)}()
    for (i,value) in enumerate(x)
        out[i] = value in seen
        push!(seen,value)
    end
    return out
end

_stla · December 15, 2022, 5:44pm

To remove the duplicates of a vector you can do v[!duplicated(v)]. Well, in this case this is equivalent to unique(v), but this can be used for another vector: x[!duplicated(v)] (useful for example v = score.(x) for a function score).

_stla · December 15, 2022, 5:46pm

Thanks. Are you sure it is better? With my function, elements marked as duplicates are not tested a second time in the next tests.

jling · December 15, 2022, 5:53pm

unique(score, x)

_stla · December 15, 2022, 5:55pm

Didn’t know that, thanks. But I use it for removing the rows of a matrix which have the same “score”: x[!duplicated([score(row) for row in eachrow(x)]), :].

_stla · December 15, 2022, 6:02pm

Ah I see, I could use unique(score, collect(eachrow(x))) instead and reconstruct a matrix with these rows. But this should be less efficient no?

stevengj · December 15, 2022, 6:23pm

Couldn’t you do

i = unique(j -> v[j], eachindex(v))
v[i]

for this purpose? This gives you an index array that you could re-use to extract corresponding slices of other arrays too.

See also the discussions at Return index vectors from unique · Issue #1845 · JuliaLang/julia · GitHub and
Is there a function similar to numpy unique with inverse? - #6 by stevengj

jling · December 15, 2022, 6:50pm

unique!.(score, eachrow(x))

this modifies x in-place instead of making copies

bkamins · December 15, 2022, 7:37pm

In DataFrames.jl you can do:

julia> df = DataFrame(x = rand(1:10^6, 10^6));

julia> @time nonunique(df, :x);
  0.034752 seconds (52 allocations: 24.585 MiB)

(which gives you a Bool vector exactly as you want and is slightly faster than the unique version that @stevengj proposed)

_stla · December 15, 2022, 7:37pm

Yes, I finally had the same idea. More generally unique(j -> score(v[j]), eachindex(v)).

skleinbo · December 16, 2022, 5:57am

Wouldn’t it be rather \mathcal{O}(n\cdot k) with n the length of the input and k the number of its unique elements? Which is n^2 if all elements are unique.

Jeff_Emanuel · December 16, 2022, 6:34am

You have n to loop over the values. For each value you do a set lookup and a possible set insertion (and setting the boolean value in out, which is clearly constant time). I mentioned log n thinking that a set might be a binary tree. If instead a set is implemented as a hash (likely), then insertion and lookup could be constant time, so its possibly (probable) just O(n). I’m not at a computer so it’s not convenient to lookup the set implementation details.

skleinbo · December 16, 2022, 7:02am

Ah, you’re of course totally right. Sorry, that wasn’t very clever of me🤦‍♂️. It is a hasmap by the way, so O(n) it is!

Topic		Replies	Views
Remove duplicated rows General Usage juliadb	2	1520	May 1, 2019
How to find duplicate rows in String Array (vec) General Usage	2	3774	November 3, 2020
Opposite of unique New to Julia sets	19	2475	March 25, 2021
Choosing only different vectors from a matrix New to Julia	8	567	September 1, 2019
Why is there no union!(::Vector,::Vector)? Internals & Design question	12	1397	January 12, 2017

R function `duplicated`

Related topics