Find array elements not present in another array

Luigi_Marongiu · March 2, 2021, 11:33am

hello,
I have two arrays with the elements

julia> h = unique(he.ID)
1421-element Array{String,1}:
 "AC_000011"
 "AC_000019"
 "AC_000189"
 "AC_000190"
 ⋮
 "NC_039209"
 "NC_039212"
 "NC_039213"
 "NC_039236"

julia> j = unique(tu.ID)
2891-element Array{String,1}:
 "AC_000006"
 "AC_000008"
 "AC_000010"
 "AC_000011"
 ⋮
 "NC_039221"
 "NC_039228"
 "NC_039231"
 "NC_039237"

How can I find the elements of h that are not in j? In other words, what is the equivalent to R’s h[!(h %in% j)]?
Thank you

lmiq · March 2, 2021, 12:07pm

This is one way to do it:

julia> y = unique(rand(1:10,10));

julia> y = unique(rand(1:10,10));

juila> z = x[(!in).(x,Ref(y))]
4-element Array{Int64,1}:
 7
 1
 6
 8

Of course (!in) is “not in”, the . broadcast that over x elements, and the Ref(y) guarantees that y will be not broadcasted.

This is the same, but prettier

julia> z = x[x .∉ Ref(y)]
4-element Array{Int64,1}:
 7
 1
 6
 8

∉ is \notin + Tab.

DNF · March 2, 2021, 12:27pm

There is actually a function dedicated to this purpose: setdiff:

help?> setdiff

  setdiff(s, itrs...)

  Construct the set of elements in s but not in any of the iterables in itrs. Maintain order with arrays.

jl> setdiff(x, y)
3-element Vector{Int64}:
 4
 8
 7

This is using different randomly generated inputs, so the answer is different.

Interestingly, for this case at least, it’s significantly slower than @lmiq’s suggestion.

lmiq · March 2, 2021, 12:32pm

It gets much faster for larger arrays:

julia> x = unique(rand(1:1000,1000)); y = unique(rand(1:1000,1000));

julia> f(x,y) = x[x .∉ Ref(y)]
f (generic function with 1 method)

julia> @btime f($x,$y);
  98.185 μs (4 allocations: 6.36 KiB)

julia> @btime setdiff($x,$y);
  19.669 μs (14 allocations: 51.04 KiB)

Eben60 · March 5, 2021, 10:33pm

This is O(n^2), as finding an element in an unordered array is O(n). setdiff however scales like something between O(n) and O(n log n) in my test.

I have no idea how it works, but I would assume it orders at least one of the arrays as searching in an ordered array is fast. Afterwords it would put the returned data into the initial order. For short arrays that may involve a (relatively) significant overhead.

Topic		Replies	Views
Find position of Array elements in another Array General Usage indexing , arrays	16	3553	February 1, 2023
Finding indices using "setdiff"? New to Julia	2	697	July 29, 2019
Removing the elements of one array from another using setdiff General Usage	4	3220	December 25, 2017
How can write a function to find unique elements in array without any allocation? General Usage	4	9161	January 31, 2020
Julia equivalent to 'not in' in R General Usage	9	4490	December 14, 2016

Find array elements not present in another array

Related topics