 # Find array elements not present in another array

hello,
I have two arrays with the elements

``````julia> h = unique(he.ID)
1421-element Array{String,1}:
"AC_000011"
"AC_000019"
"AC_000189"
"AC_000190"
⋮
"NC_039209"
"NC_039212"
"NC_039213"
"NC_039236"

julia> j = unique(tu.ID)
2891-element Array{String,1}:
"AC_000006"
"AC_000008"
"AC_000010"
"AC_000011"
⋮
"NC_039221"
"NC_039228"
"NC_039231"
"NC_039237"
``````

How can I find the elements of h that are not in j? In other words, what is the equivalent to R’s `h[!(h %in% j)]`?
Thank you

This is one way to do it:

``````julia> y = unique(rand(1:10,10));

julia> y = unique(rand(1:10,10));

juila> z = x[(!in).(x,Ref(y))]
4-element Array{Int64,1}:
7
1
6
8

``````

Of course `(!in)` is “not in”, the `.` broadcast that over `x` elements, and the `Ref(y)` guarantees that `y` will be not broadcasted.

This is the same, but prettier ``````julia> z = x[x .∉ Ref(y)]
4-element Array{Int64,1}:
7
1
6
8

``````

∉ is `\notin` + Tab.

3 Likes

There is actually a function dedicated to this purpose: `setdiff`:

``````help?> setdiff

setdiff(s, itrs...)

Construct the set of elements in s but not in any of the iterables in itrs. Maintain order with arrays.
``````
``````jl> setdiff(x, y)
3-element Vector{Int64}:
4
8
7
``````

This is using different randomly generated inputs, so the answer is different.

Interestingly, for this case at least, it’s significantly slower than @leandromartinez98’s suggestion.

6 Likes

It gets much faster for larger arrays:

``````julia> x = unique(rand(1:1000,1000)); y = unique(rand(1:1000,1000));

julia> f(x,y) = x[x .∉ Ref(y)]
f (generic function with 1 method)

julia> @btime f(\$x,\$y);
98.185 μs (4 allocations: 6.36 KiB)

julia> @btime setdiff(\$x,\$y);
19.669 μs (14 allocations: 51.04 KiB)

``````
6 Likes

This is `O(n^2)`, as finding an element in an unordered array is `O(n)`. `setdiff` however scales like something between `O(n)` and `O(n log n)` in my test.

I have no idea how it works, but I would assume it orders at least one of the arrays as searching in an ordered array is fast. Afterwords it would put the returned data into the initial order. For short arrays that may involve a (relatively) significant overhead.

1 Like