hello,
I have two arrays with the elements

```
julia> h = unique(he.ID)
1421-element Array{String,1}:
"AC_000011"
"AC_000019"
"AC_000189"
"AC_000190"
⋮
"NC_039209"
"NC_039212"
"NC_039213"
"NC_039236"
julia> j = unique(tu.ID)
2891-element Array{String,1}:
"AC_000006"
"AC_000008"
"AC_000010"
"AC_000011"
⋮
"NC_039221"
"NC_039228"
"NC_039231"
"NC_039237"
```

How can I find the elements of h that are not in j? In other words, what is the equivalent to R’s `h[!(h %in% j)]`

?
Thank you

lmiq
March 2, 2021, 12:07pm
#2
This is one way to do it:

```
julia> y = unique(rand(1:10,10));
julia> y = unique(rand(1:10,10));
juila> z = x[(!in).(x,Ref(y))]
4-element Array{Int64,1}:
7
1
6
8
```

Of course `(!in)`

is “not in”, the `.`

broadcast that over `x`

elements, and the `Ref(y)`

guarantees that `y`

will be not broadcasted.

This is the same, but prettier

```
julia> z = x[x .∉ Ref(y)]
4-element Array{Int64,1}:
7
1
6
8
```

∉ is `\notin`

+ Tab.

3 Likes

DNF
March 2, 2021, 12:27pm
#3
There is actually a function dedicated to this purpose: `setdiff`

:

```
help?> setdiff
setdiff(s, itrs...)
Construct the set of elements in s but not in any of the iterables in itrs. Maintain order with arrays.
```

```
jl> setdiff(x, y)
3-element Vector{Int64}:
4
8
7
```

This is using different randomly generated inputs, so the answer is different.

Interestingly, for this case at least, it’s significantly slower than @lmiq ’s suggestion.

6 Likes

lmiq
March 2, 2021, 12:32pm
#4

DNF:

slower

It gets much faster for larger arrays:

```
julia> x = unique(rand(1:1000,1000)); y = unique(rand(1:1000,1000));
julia> f(x,y) = x[x .∉ Ref(y)]
f (generic function with 1 method)
julia> @btime f($x,$y);
98.185 μs (4 allocations: 6.36 KiB)
julia> @btime setdiff($x,$y);
19.669 μs (14 allocations: 51.04 KiB)
```

6 Likes

Eben60
March 5, 2021, 10:33pm
#5

lmiq:

`x .∉ Ref(y)`

This is `O(n^2)`

, as finding an element in an unordered array is `O(n)`

. `setdiff`

however scales like something between `O(n)`

and `O(n log n)`

in my test.

I have no idea how it works, but I would assume it orders at least one of the arrays as searching in an ordered array is fast. Afterwords it would put the returned data into the initial order. For short arrays that may involve a (relatively) significant overhead.

1 Like