# Indices of intersection of two arrays

Suppose I have two arrays:

``````a = [2, 4, 6, 7, 10, 11]
b = [6, 7, 10, 11, 13, 15, 17, 19, 21, 23]
``````

How do I get the indices of values in `b` that are in `a` and vice versa? I know `intersection()` returns the values, but I’m after the indices of the two arrays. Right now I’m doing

``````indices = [findall(x->x==i,a) for i in b]
10-element Array{Array{Int64,1},1}:
[3]
[4]
[5]
[6]
[]
[]
[]
[]
[]
[]
``````

which doesn’t seem optimal. Any ideas? Thanks in advance!

You could write a function that loops through the two sorted vectors pretty easily.

1 Like
``````findall(x->x in b, a)
``````

or

``````findall(in(b),a)
``````
13 Likes

Perfect, thanks!

I am sharing a routine that efficiently loops through the two sorted vectors, in case someone stumbles upon this thread in the future.

The complexity is `O(n+m)`, where `n = length(x)` and `m = length(y)`.

``````function sorted_intersect(x::AbstractVector, y::AbstractVector)

@assert issorted( x )
@assert issorted( y )

idx_common_x = Vector{Int}()
idx_common_y = Vector{Int}()
i,j = firstindex(x), firstindex(y)
@inbounds while i <= lastindex(x) && j <= lastindex(y)
if x[i] == y[j]
push!( idx_common_x, i )
push!( idx_common_y, j )
i += 1
j += 1
elseif x[i] > y[j]
j += 1
else
i += 1
end
end

idx_common_x, idx_common_y

end
``````
4 Likes

It would be interesting to benchmark against the other suggestions, for reference.

4 Likes

You may get another 2X speedup by avoiding `push!` inside the loop and using `BitArray`s to store common indices flag instead. Although `push!` is quite efficient in Julia, using it inside hot loops can be expensive. The speedup from this change can vary depending on the ratio of common indices.

``````function sorted_intersect2(x::AbstractVector, y::AbstractVector)
@assert issorted( x )
@assert issorted( y )

idx_common_x = falses(length(x))
idx_common_y = falses(length(y))
i, j = firstindex(x), firstindex(y)
while i <= lastindex(x) && j <= lastindex(y)
if x[i] < y[j]
i += 1
elseif y[j] < x[i]
j += 1
else
idx_common_x[i] = true
idx_common_y[j] = true
i += 1
j += 1
end
end
(1:length(x))[idx_common_x], (1:length(y))[idx_common_y]
end

function sorted_intersect(x::AbstractVector, y::AbstractVector)
@assert issorted( x )
@assert issorted( y )

idx_common_x = Vector{Int}()
idx_common_y = Vector{Int}()
i,j = firstindex(x), firstindex(y)
while i <= lastindex(x) && j <= lastindex(y)
if x[i] == y[j]
push!( idx_common_x, i )
push!( idx_common_y, j )
i += 1
j += 1
elseif x[i] > y[j]
j += 1
else
i += 1
end
end
idx_common_x, idx_common_y
end

a = [1:1000;]    #[2, 4, 6, 7, 10, 11]
b = [600:1850;]  #[6, 7, 10, 11, 13, 15, 17, 19, 21, 23]
@assert sorted_intersect(a, b) == sorted_intersect2(a, b)
@btime (findall(in(\$b),\$a) , findall(in(\$a),\$b))
8.767 μs (10 allocations: 15.12 KiB)
@btime sorted_intersect(\$a, \$b)
6.700 μs (10 allocations: 15.12 KiB)
@btime sorted_intersect2(\$a, \$b)
3.263 μs (6 allocations: 7.12 KiB)
``````
5 Likes

Thank you for benchmarking the different solutions.
I am re-running your experiments for larger arrays (1 million elements)

``````julia> a = sort( rand(1:1_000_000_000, 1_000_000) );
julia> b = sort( rand(1:1_000_000_000, 1_000_000) );
julia> @btime (findall(in(\$b),\$a) , findall(in(\$a),\$b))
21.785 ms (20 allocations: 32.78 KiB)
julia> @btime sorted_intersect(\$a, \$b)
12.730 ms (20 allocations: 32.78 KiB)
julia> @btime sorted_intersect2(\$a, \$b)
11.034 ms (8 allocations: 260.47 KiB)
``````

Version 2 is indeed slightly faster, with a small overhead in memory.

• Version 1 will allocate memory equal to the size of the intersection, while
• Version 2 will allocate `m + n` bytes of memory.

Depending on the application, the user may decide which is one is more appropriate.

1 Like

Indeed, as the vector size increases, memory allocations happen less and less often and the benefit from using a BitArray becomes less evident after some threshold. Many algorithms follow this pattern, but the problem is that it is difficult to predict such a threshold, especially on different hardware. I remember applying these tricks to some algorithms like in Primes.jl, and around 10^9 vector lengths, push! becomes more efficient. I imagine push! will become more smarter in the future or hardware will be able to handle this and we won’t need such tricks.

3 Likes

Here is a version that works for non sorted arrays

a =[3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

function intersectalamatlab( a , b )
function findindices!( resa , ab , a)
for ( i , el) ∈ enumerate(ab)
resa[i] = findfirst( x->x==el , a )
end
end
ab = intersect(a,b)
resa=Vector{Int64}(undef,length(ab))
findindices!( resa , ab , a)
resa
resb=similar(resa)
findindices!( resb , ab , b)
resb

``````return (ab , resa , resb )
``````

end

intersectalamatlab( a , b )

1 Like

@FredericN, fyi the accepted solution works for non-sorted arrays:

``````a = [3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

findall(in(b),a)    # indices of elements of 'a' that occur in 'b'
3
4
``````

If we put your code “à la MATLAB” inside triple backticks:

``````function intersectalamatlab(a, b)
function findindices!(resa, ab, a)
for (i, el) ∈ enumerate(ab)
resa[i] = findfirst(x->x==el, a)
end
end
ab = intersect(a, b)
resa = Vector{Int64}(undef, length(ab))
findindices!(resa, ab, a)
resb = similar(resa)
findindices!(resb, ab, b)
return (ab, resa, resb)
end
``````

and run it, it outputs both the elements and the indices in both arrays:

``````intersectalamatlab(a, b)
([123, 12], [3, 4], [5, 1])
``````

For the examples further above with a and b of length 1 million each, the accepted solution seems to run ~12x faster for unsorted inputs, and ~80x faster otherwise.

I don’t know if it’s a relevant case, but how to handle repeated values in vectors?

``````a = [3 , 12, 45 , 123 , 12]
b = [12 , 19 ,123, 46 , 56 , 123]
``````

given both arrays are sorted, we could binary search the lower and upper bounds in Array A, while iterating through values in array B, giving us `O(bloga)` complexity.

One more way is to de-duplicate array `a` and `b`. We maintain a dictionary preserving a mapping of `element -> indices it occurs in array`, and then apply the algorithm from @Seif_Shebl 's answer .

Possibly running it twice:

``````ia = findall(in(b), a)
ib = findall(in(view(a,ia)), b)
# list all common elements: a[ia]; b[ib]
``````

The original matlab function ‘intersect’ returns not only the intersection but the positions in the original vectors, this is why it returns a triplet. I need such a function in Julia. If you have something faster that works for unsorted array, I would be glad to have it. Does it answer your question?

The following code aspires to participate at the next Obfuscated Julia Code Contest

``````
a = [3 , 12, 45 , 123 , 12,45,12,7,7,7,3,12,7]
b = [12 , 19 ,123, 46 , 56 , 123]

function cod(l,x,y)
exp=ceil(log(10,l))+1
x*10^exp+y
end

function decod(l,n)
grp=10^(ceil(log(10,l))+1)
pos=Int[]
while n>=1
(n,r)=divrem(n,grp)
push!(pos,r)
end
pos
end

da=merge((x,y)->cod(length(a),x,y),Dict.(Base.splat(Pair).(reverse.(enumerate(a))))...)
db=merge((x,y)->cod(length(b),x,y),Dict.(Base.splat(Pair).(reverse.(enumerate(b))))...)

d=intersect(keys(da),keys(db))
collect(Iterators.flatten(decod.(length(a),get.([da], d,missing))))
collect(Iterators.flatten(decod.(length(b),get.([db], d,missing))))

``````
`````` function decod1(l,n,pos=Int[])
grp=10^(ceil(log(10,l))+1)
(n,r)=divrem(n,grp)
push!(pos,r)
n >= 1 ? decod1(l,n,pos) : pos
end
``````

@FredericN, as requested please find below a function `intersectalajulia()` based in the OP’s accepted solution, which handles repetead elements and runs faster for the million-element vectors benchmarked further below:

``````function intersectalajulia(a,b)
ia = findall(in(b), a)
return unique(view(a,ia)), ia, findall(in(view(a,ia)), b)
end

a = [3 , 12, 45 , 123 , 12]
b = [12 , 19 ,123, 46 , 56 , 123]
intersectalamatlab(a, b)    # ([12, 123], [2, 4], [1, 3])
intersectalajulia(a,b)      # ([12, 123], [2, 4, 5], [1, 3, 6])

using BenchmarkTools
a = rand(1:1_000_000_000, 1_000_000);
b = rand(1:1_000_000_000, 1_000_000);
@benchmark intersectalamatlab(a, b)     # min = 846 ms, 24 alloc, 43.6 MiB
@benchmark intersectalajulia(a, b)      # min = 77 ms, 50 alloc, 18.1 MiB
``````
``````Thanks for your solution. But my problem is that with the following input:
a =[3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

the outputs are not the same:
julia> intersectalajulia(a,b)
([123, 12], [3, 4], [1, 5])

julia> intersectalamatlab( a , b )
([123, 12], [3, 4], [5, 1])

I want to keep track that a[3] = b[5] and a[4] = b[1]. Your solution is faster but does not provide the output I am looking for.``````

OK, the intent was not clear in your OP.

I think Julia’s function `indexin()` can be added to achieve what you want and the original speed is barely affected:

``````function intersectalajulia2(a,b)
ia = findall(in(b), a)
ib = findall(in(view(a,ia)), b)
return unique(view(a,ia)), ia, ib[indexin(view(a,ia), view(b,ib))]
end
``````

Run several random tests (for non-repeated elements) and the equality was always true:

``````using BenchmarkTools, StatsBase
a = sample(1:rand(30:100), rand(1:20), replace=false);
b = sample(1:rand(30:100), rand(1:20), replace=false);
intersectalamatlab(a, b) == intersectalajulia2(a, b)  # true
``````

Dear Rafael,

many thanks for the answer but at least with Julia 1.6 I have problems with the following test:

a = rand(1:1_000, 1_000);
b = rand(1:1_000, 1_000);
intersectalajulia2(a,b)

I get one over five times (say) an error message of the type:
ERROR: BoundsError: attempt to access 615-element Vector{Int64} at index [Union{Nothing, Int64}[566, 129, 459, 110, 269, 372, 261, 265, 76, 137 … 42, 327, 201, 314, 16, 559, 97, 201, 609, 36]]
Stacktrace:
[1] throw_boundserror(A::Vector{Int64}, I::Tuple{Vector{Union{Nothing, Int64}}})
@ Base ./abstractarray.jl:651
[2] checkbounds
@ ./abstractarray.jl:616 [inlined]
[3] _getindex
@ ./multidimensional.jl:831 [inlined]
[4] getindex
@ ./abstractarray.jl:1170 [inlined]
[5] intersectalajulia2(a::Vector{Int64}, b::Vector{Int64})
@ Main ./none:4
[6] top-level scope
@ none:1