hello,
I have two arrays with the elements
julia> h = unique(he.ID)
1421-element Array{String,1}:
"AC_000011"
"AC_000019"
"AC_000189"
"AC_000190"
⋮
"NC_039209"
"NC_039212"
"NC_039213"
"NC_039236"
julia> j = unique(tu.ID)
2891-element Array{String,1}:
"AC_000006"
"AC_000008"
"AC_000010"
"AC_000011"
⋮
"NC_039221"
"NC_039228"
"NC_039231"
"NC_039237"
How can I find the elements of h that are not in j? In other words, what is the equivalent to R’s h[!(h %in% j)]
?
Thank you
lmiq
March 2, 2021, 12:07pm
2
This is one way to do it:
julia> y = unique(rand(1:10,10));
julia> y = unique(rand(1:10,10));
juila> z = x[(!in).(x,Ref(y))]
4-element Array{Int64,1}:
7
1
6
8
Of course (!in)
is “not in”, the .
broadcast that over x
elements, and the Ref(y)
guarantees that y
will be not broadcasted.
This is the same, but prettier
julia> z = x[x .∉ Ref(y)]
4-element Array{Int64,1}:
7
1
6
8
∉ is \notin
+ Tab.
4 Likes
DNF
March 2, 2021, 12:27pm
3
There is actually a function dedicated to this purpose: setdiff
:
help?> setdiff
setdiff(s, itrs...)
Construct the set of elements in s but not in any of the iterables in itrs. Maintain order with arrays.
jl> setdiff(x, y)
3-element Vector{Int64}:
4
8
7
This is using different randomly generated inputs, so the answer is different.
Interestingly, for this case at least, it’s significantly slower than @lmiq ’s suggestion.
8 Likes
lmiq
March 2, 2021, 12:32pm
4
DNF:
slower
It gets much faster for larger arrays:
julia> x = unique(rand(1:1000,1000)); y = unique(rand(1:1000,1000));
julia> f(x,y) = x[x .∉ Ref(y)]
f (generic function with 1 method)
julia> @btime f($x,$y);
98.185 μs (4 allocations: 6.36 KiB)
julia> @btime setdiff($x,$y);
19.669 μs (14 allocations: 51.04 KiB)
6 Likes
Eben60
March 5, 2021, 10:33pm
5
lmiq:
x .∉ Ref(y)
This is O(n^2)
, as finding an element in an unordered array is O(n)
. setdiff
however scales like something between O(n)
and O(n log n)
in my test.
I have no idea how it works, but I would assume it orders at least one of the arrays as searching in an ordered array is fast. Afterwords it would put the returned data into the initial order. For short arrays that may involve a (relatively) significant overhead.
1 Like