Indices of intersection of two arrays

Suppose I have two arrays:

a = [2, 4, 6, 7, 10, 11]
b = [6, 7, 10, 11, 13, 15, 17, 19, 21, 23]

How do I get the indices of values in b that are in a and vice versa? I know intersection() returns the values, but I’m after the indices of the two arrays. Right now I’m doing

indices = [findall(x->x==i,a) for i in b]
10-element Array{Array{Int64,1},1}:
 [3]
 [4]
 [5]
 [6]
 [] 
 [] 
 [] 
 [] 
 [] 
 []

which doesn’t seem optimal. Any ideas? Thanks in advance!

You could write a function that loops through the two sorted vectors pretty easily.

1 Like
findall(x->x in b, a)

or

findall(in(b),a)
14 Likes

Perfect, thanks!

I am sharing a routine that efficiently loops through the two sorted vectors, in case someone stumbles upon this thread in the future.

The complexity is O(n+m), where n = length(x) and m = length(y).

function sorted_intersect(x::AbstractVector, y::AbstractVector)

  @assert issorted( x )
  @assert issorted( y )

  idx_common_x = Vector{Int}()
  idx_common_y = Vector{Int}()
  i,j = firstindex(x), firstindex(y)
  @inbounds while i <= lastindex(x) && j <= lastindex(y)
    if x[i] == y[j]
      push!( idx_common_x, i )
      push!( idx_common_y, j )
      i += 1
      j += 1
    elseif x[i] > y[j]
      j += 1
    else
      i += 1
    end
  end

  idx_common_x, idx_common_y

end
4 Likes

:grinning: It would be interesting to benchmark against the other suggestions, for reference.

4 Likes

You may get another 2X speedup by avoiding push! inside the loop and using BitArrays to store common indices flag instead. Although push! is quite efficient in Julia, using it inside hot loops can be expensive. The speedup from this change can vary depending on the ratio of common indices.

function sorted_intersect2(x::AbstractVector, y::AbstractVector)
    @assert issorted( x ) 
    @assert issorted( y ) 

    idx_common_x = falses(length(x))
    idx_common_y = falses(length(y))
    i, j = firstindex(x), firstindex(y)
    while i <= lastindex(x) && j <= lastindex(y)
        if x[i] < y[j]
            i += 1
        elseif y[j] < x[i]
            j += 1
        else
            idx_common_x[i] = true
            idx_common_y[j] = true
            i += 1
            j += 1
        end
    end
    (1:length(x))[idx_common_x], (1:length(y))[idx_common_y]
end

function sorted_intersect(x::AbstractVector, y::AbstractVector)
    @assert issorted( x )
    @assert issorted( y )

    idx_common_x = Vector{Int}()
    idx_common_y = Vector{Int}()
    i,j = firstindex(x), firstindex(y)
    while i <= lastindex(x) && j <= lastindex(y)
        if x[i] == y[j]
            push!( idx_common_x, i )
            push!( idx_common_y, j )
            i += 1
            j += 1
        elseif x[i] > y[j]
            j += 1
        else
            i += 1
        end
    end
    idx_common_x, idx_common_y
end

a = [1:1000;]    #[2, 4, 6, 7, 10, 11]
b = [600:1850;]  #[6, 7, 10, 11, 13, 15, 17, 19, 21, 23]
@assert sorted_intersect(a, b) == sorted_intersect2(a, b)
@btime (findall(in($b),$a) , findall(in($a),$b))
  8.767 μs (10 allocations: 15.12 KiB)
@btime sorted_intersect($a, $b)
  6.700 μs (10 allocations: 15.12 KiB)
@btime sorted_intersect2($a, $b)
  3.263 μs (6 allocations: 7.12 KiB)
5 Likes

Thank you for benchmarking the different solutions.
I am re-running your experiments for larger arrays (1 million elements)

julia> a = sort( rand(1:1_000_000_000, 1_000_000) );
julia> b = sort( rand(1:1_000_000_000, 1_000_000) );
julia> @btime (findall(in($b),$a) , findall(in($a),$b))
  21.785 ms (20 allocations: 32.78 KiB)
julia> @btime sorted_intersect($a, $b)
  12.730 ms (20 allocations: 32.78 KiB)
julia> @btime sorted_intersect2($a, $b)
  11.034 ms (8 allocations: 260.47 KiB)

Version 2 is indeed slightly faster, with a small overhead in memory.

  • Version 1 will allocate memory equal to the size of the intersection, while
  • Version 2 will allocate m + n bytes of memory.

Depending on the application, the user may decide which is one is more appropriate.

1 Like

Indeed, as the vector size increases, memory allocations happen less and less often and the benefit from using a BitArray becomes less evident after some threshold. Many algorithms follow this pattern, but the problem is that it is difficult to predict such a threshold, especially on different hardware. I remember applying these tricks to some algorithms like in Primes.jl, and around 10^9 vector lengths, push! becomes more efficient. I imagine push! will become more smarter in the future or hardware will be able to handle this and we won’t need such tricks.

3 Likes

Here is a version that works for non sorted arrays

a =[3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

function intersectalamatlab( a , b )
function findindices!( resa , ab , a)
for ( i , el) ∈ enumerate(ab)
resa[i] = findfirst( x->x==el , a )
end
end
ab = intersect(a,b)
resa=Vector{Int64}(undef,length(ab))
findindices!( resa , ab , a)
resa
resb=similar(resa)
findindices!( resb , ab , b)
resb

return (ab , resa , resb )

end

intersectalamatlab( a , b )

1 Like

@FredericN, fyi the accepted solution works for non-sorted arrays:

a = [3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

findall(in(b),a)    # indices of elements of 'a' that occur in 'b'
 3
 4

If we put your code “à la MATLAB” inside triple backticks:

function intersectalamatlab(a, b)
    function findindices!(resa, ab, a)
        for (i, el) ∈ enumerate(ab)
            resa[i] = findfirst(x->x==el, a)
        end
    end
    ab = intersect(a, b)
    resa = Vector{Int64}(undef, length(ab))
    findindices!(resa, ab, a)
    resb = similar(resa)
    findindices!(resb, ab, b)
    return (ab, resa, resb)
end

and run it, it outputs both the elements and the indices in both arrays:

intersectalamatlab(a, b)
([123, 12], [3, 4], [5, 1])

For the examples further above with a and b of length 1 million each, the accepted solution seems to run ~12x faster for unsorted inputs, and ~80x faster otherwise.

I don’t know if it’s a relevant case, but how to handle repeated values in vectors?

a = [3 , 12, 45 , 123 , 12]
b = [12 , 19 ,123, 46 , 56 , 123]

given both arrays are sorted, we could binary search the lower and upper bounds in Array A, while iterating through values in array B, giving us O(bloga) complexity.

One more way is to de-duplicate array a and b. We maintain a dictionary preserving a mapping of element -> indices it occurs in array, and then apply the algorithm from @Seif_Shebl 's answer :slight_smile:.

Possibly running it twice:

ia = findall(in(b), a)
ib = findall(in(view(a,ia)), b)
# list all common elements: a[ia]; b[ib]

The original matlab function ‘intersect’ returns not only the intersection but the positions in the original vectors, this is why it returns a triplet. I need such a function in Julia. If you have something faster that works for unsorted array, I would be glad to have it. Does it answer your question?

The following code aspires to participate at the next Obfuscated Julia Code Contest :grinning_face_with_smiling_eyes:


a = [3 , 12, 45 , 123 , 12,45,12,7,7,7,3,12,7]
b = [12 , 19 ,123, 46 , 56 , 123]



 function cod(l,x,y)
    exp=ceil(log(10,l))+1
    x*10^exp+y
 end

 function decod(l,n)
    grp=10^(ceil(log(10,l))+1)
    pos=Int[]
    while n>=1
    (n,r)=divrem(n,grp)
        push!(pos,r)
    end
    pos
 end

 
da=merge((x,y)->cod(length(a),x,y),Dict.(Base.splat(Pair).(reverse.(enumerate(a))))...)
db=merge((x,y)->cod(length(b),x,y),Dict.(Base.splat(Pair).(reverse.(enumerate(b))))...)

d=intersect(keys(da),keys(db))
collect(Iterators.flatten(decod.(length(a),get.([da], d,missing))))
collect(Iterators.flatten(decod.(length(b),get.([db], d,missing))))

 function decod1(l,n,pos=Int[])
    grp=10^(ceil(log(10,l))+1)
    (n,r)=divrem(n,grp)
    push!(pos,r)
    n >= 1 ? decod1(l,n,pos) : pos
 end
1 Like

@FredericN, as requested please find below a function intersectalajulia() based in the OP’s accepted solution, which handles repetead elements and runs faster for the million-element vectors benchmarked further below:

function intersectalajulia(a,b)
    ia = findall(in(b), a)
    return unique(view(a,ia)), ia, findall(in(view(a,ia)), b)   
end

a = [3 , 12, 45 , 123 , 12]
b = [12 , 19 ,123, 46 , 56 , 123]
intersectalamatlab(a, b)    # ([12, 123], [2, 4], [1, 3])
intersectalajulia(a,b)      # ([12, 123], [2, 4, 5], [1, 3, 6])

using BenchmarkTools
a = rand(1:1_000_000_000, 1_000_000);
b = rand(1:1_000_000_000, 1_000_000);
@benchmark intersectalamatlab(a, b)     # min = 846 ms, 24 alloc, 43.6 MiB
@benchmark intersectalajulia(a, b)      # min = 77 ms, 50 alloc, 18.1 MiB
1 Like
Thanks for your solution. But my problem is that with the following input:
a =[3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

the outputs are not the same:
julia> intersectalajulia(a,b)
([123, 12], [3, 4], [1, 5])

julia> intersectalamatlab( a , b )
([123, 12], [3, 4], [5, 1])

I want to keep track that a[3] = b[5] and a[4] = b[1]. Your solution is faster but does not provide the output I am looking for.

OK, the intent was not clear in your OP.

I think Julia’s function indexin() can be added to achieve what you want and the original speed is barely affected:

function intersectalajulia2(a,b)
    ia = findall(in(b), a)
    ib = findall(in(view(a,ia)), b)
    return unique(view(a,ia)), ia, ib[indexin(view(a,ia), view(b,ib))]
end

Run several random tests (for non-repeated elements) and the equality was always true:

using BenchmarkTools, StatsBase
a = sample(1:rand(30:100), rand(1:20), replace=false);
b = sample(1:rand(30:100), rand(1:20), replace=false);
intersectalamatlab(a, b) == intersectalajulia2(a, b)  # true
1 Like

Dear Rafael,

many thanks for the answer but at least with Julia 1.6 I have problems with the following test:

a = rand(1:1_000, 1_000);
b = rand(1:1_000, 1_000);
intersectalajulia2(a,b)

I get one over five times (say) an error message of the type:
ERROR: BoundsError: attempt to access 615-element Vector{Int64} at index [Union{Nothing, Int64}[566, 129, 459, 110, 269, 372, 261, 265, 76, 137 … 42, 327, 201, 314, 16, 559, 97, 201, 609, 36]]
Stacktrace:
[1] throw_boundserror(A::Vector{Int64}, I::Tuple{Vector{Union{Nothing, Int64}}})
@ Base ./abstractarray.jl:651
[2] checkbounds
@ ./abstractarray.jl:616 [inlined]
[3] _getindex
@ ./multidimensional.jl:831 [inlined]
[4] getindex
@ ./abstractarray.jl:1170 [inlined]
[5] intersectalajulia2(a::Vector{Int64}, b::Vector{Int64})
@ Main ./none:4
[6] top-level scope
@ none:1