Hi!
Is there a function that does the opposite of unique? Say
nonunique([1,2,3,4,3,5,3,2]) = [2,3]
Thanks a lot!
Hi!
Is there a function that does the opposite of unique? Say
nonunique([1,2,3,4,3,5,3,2]) = [2,3]
Thanks a lot!
julia> using DataStructures
julia> nonunique(v) = [k for (k, v) in counter(v) if v > 1]
nonunique (generic function with 1 method)
julia> nonunique(v)
2-element Vector{Int64}:
2
3
Equivalently to the above, you can use countmap
from StatsBase
instead of counter
from DataStructures
:
using StatsBase
[k for (k, v) in countmap(v) if v > 1]
For fun, code for this without using any packages and with only passing over the data once.
This is kinda ugly because i am golfing it at bit:
function nonunique(v)
seen = Dict{eltype(v), Ref{Int}}()
[x for x in v if 2 == (get!(()->Ref(0), seen, x)[]+=1)]
end
which does (still unsorted could sort after)
julia> nonunique([1,2,3,4,3,5,3,2])
2-element Array{Int64,1}:
3
2
another fun way: use sort
then diff
to find things that occur after things that are the same as them, then unique
to drop extra multiples
function nonunique(v)
sv = sort(v)
return unique(@view sv[[diff(sv).==0; false]])
end
I’ld probably use one of the packages and two passes though.
Or a Dict
and two passes
That’s not quite the opposite to unique
. Note that unique([1,2,3,4,3,5,3,2]) == [1,2,3,4,5]
, not [1,4,5]
.
Maybe a better name would be only_repeated_values
or something like that.
@stillyslalom @CameronBieganek @oxinabox thanks for coding something for me! I was just asking if there was a built in function, didn’t expect you to write the solution
I ended up using symdiff(v,unique(v))
, which works for my specific case (no more than 2 of the same number, and I also need unique, so that is available for free).
Thanks again!
R calls this function duplicated
. Rather, R’s duplicated()
would return the indices of the second 2 and second and third 3 in the Ribiero’s example. Then again, R was never known for having overly-descriptive function names.
Using the beautiful Multisets.jl package:
using Multisets
v = [1,2,3,4,3,5,3,2]
M = Multiset(v)
U = Multiset(Set(M))
collect(keys(M-U))
2-element Vector{Int64}:
2
3
And another way:
using Multisets
v = [1,2,3,4,3,5,3,2]
M = Multiset(v)
collect(keys(M))[values(M).>1]
2-element Vector{Int64}:
2
3
On this topic, see also the fast solutions by Przemyslaw Szufel and Bogumił Kamiński in stackoverflow.
NB! log y axis
results
9-element BenchmarkTools.BenchmarkGroup:
tags: []
"dict" => Trial(357.094 μs)
"symdiff" => Trial(473.081 μs)
"szufelinplace" => Trial(39.201 μs)
"countmap" => Trial(229.767 μs)
"counter" => Trial(196.279 μs)
"multiset1" => Trial(792.868 μs)
"sort" => Trial(65.060 μs)
"szufel" => Trial(42.194 μs)
"multiset2" => Trial(429.816 μs)
@gustaphe, very nice summary but what a weird logarithmic scale axis that one is (with ticks at 10^4.8
, etc.). Integer powers would be easier to read.
Yeah, for some reason that’s the default behaviour in GR. You can set the ticks manually, but I didn’t feel like it.
How’s this for effort?
I’d say you’ve studied this enough that you can propose that your best function gets added to Base
=]
@gustaphe, it is apparent that you’ve found the right plunger shapes to unclog the non-unique problem. Thanks for the inspirational drawing.
a = [1,2,3,4,3,5,3,2]
[i for i in unique(a) if sum(i .== a) > 1 ]
This is not really how things work. New methods are added to Base
only if they are sufficiently basic, I have to say that I think I never needed something as specific as this.
It’s funny, that’s probably what I would have written. A fairly intuitive solution.
It’s by far the slowest of the suggested ones. It really stands out.
When working with datasets I would try dataframe utilities for such kind of vector problems. It seems another good approach in this case (for current versions).
using DataFrames, BenchmarkTools
nonunique(x) =
filter(:nrow => >(1), combine(groupby(DataFrame(x = x), :x), nrow)).x
julia> @btime nonunique([1, 2, 3, 4, 3, 5, 3, 2])
12.550 μs (202 allocations: 17.28 KiB)
2-element Array{Int64,1}:
2
3
ps. The result is unsorted.