Indices of intersection of two arrays

wonderboy3489 · April 11, 2019, 4:07pm

Suppose I have two arrays:

a = [2, 4, 6, 7, 10, 11]
b = [6, 7, 10, 11, 13, 15, 17, 19, 21, 23]

How do I get the indices of values in b that are in a and vice versa? I know intersection() returns the values, but I’m after the indices of the two arrays. Right now I’m doing

indices = [findall(x->x==i,a) for i in b]
10-element Array{Array{Int64,1},1}:
 [3]
 [4]
 [5]
 [6]
 [] 
 [] 
 [] 
 [] 
 [] 
 []

which doesn’t seem optimal. Any ideas? Thanks in advance!

StefanKarpinski · April 11, 2019, 4:15pm

You could write a function that loops through the two sorted vectors pretty easily.

yha · April 11, 2019, 4:25pm

findall(x->x in b, a)

or

findall(in(b),a)

wonderboy3489 · April 11, 2019, 4:26pm

Perfect, thanks!

fcdimitr · July 22, 2021, 5:19pm

I am sharing a routine that efficiently loops through the two sorted vectors, in case someone stumbles upon this thread in the future.

The complexity is O(n+m), where n = length(x) and m = length(y).

function sorted_intersect(x::AbstractVector, y::AbstractVector)

  @assert issorted( x )
  @assert issorted( y )

  idx_common_x = Vector{Int}()
  idx_common_y = Vector{Int}()
  i,j = firstindex(x), firstindex(y)
  @inbounds while i <= lastindex(x) && j <= lastindex(y)
    if x[i] == y[j]
      push!( idx_common_x, i )
      push!( idx_common_y, j )
      i += 1
      j += 1
    elseif x[i] > y[j]
      j += 1
    else
      i += 1
    end
  end

  idx_common_x, idx_common_y

end

ptoche · July 23, 2021, 12:47am

It would be interesting to benchmark against the other suggestions, for reference.

Seif_Shebl · July 23, 2021, 1:00pm

You may get another 2X speedup by avoiding push! inside the loop and using BitArrays to store common indices flag instead. Although push! is quite efficient in Julia, using it inside hot loops can be expensive. The speedup from this change can vary depending on the ratio of common indices.

function sorted_intersect2(x::AbstractVector, y::AbstractVector)
    @assert issorted( x ) 
    @assert issorted( y ) 

    idx_common_x = falses(length(x))
    idx_common_y = falses(length(y))
    i, j = firstindex(x), firstindex(y)
    while i <= lastindex(x) && j <= lastindex(y)
        if x[i] < y[j]
            i += 1
        elseif y[j] < x[i]
            j += 1
        else
            idx_common_x[i] = true
            idx_common_y[j] = true
            i += 1
            j += 1
        end
    end
    (1:length(x))[idx_common_x], (1:length(y))[idx_common_y]
end

function sorted_intersect(x::AbstractVector, y::AbstractVector)
    @assert issorted( x )
    @assert issorted( y )

    idx_common_x = Vector{Int}()
    idx_common_y = Vector{Int}()
    i,j = firstindex(x), firstindex(y)
    while i <= lastindex(x) && j <= lastindex(y)
        if x[i] == y[j]
            push!( idx_common_x, i )
            push!( idx_common_y, j )
            i += 1
            j += 1
        elseif x[i] > y[j]
            j += 1
        else
            i += 1
        end
    end
    idx_common_x, idx_common_y
end

a = [1:1000;]    #[2, 4, 6, 7, 10, 11]
b = [600:1850;]  #[6, 7, 10, 11, 13, 15, 17, 19, 21, 23]
@assert sorted_intersect(a, b) == sorted_intersect2(a, b)
@btime (findall(in($b),$a) , findall(in($a),$b))
  8.767 μs (10 allocations: 15.12 KiB)
@btime sorted_intersect($a, $b)
  6.700 μs (10 allocations: 15.12 KiB)
@btime sorted_intersect2($a, $b)
  3.263 μs (6 allocations: 7.12 KiB)

fcdimitr · July 24, 2021, 10:16am

Thank you for benchmarking the different solutions.
I am re-running your experiments for larger arrays (1 million elements)

julia> a = sort( rand(1:1_000_000_000, 1_000_000) );
julia> b = sort( rand(1:1_000_000_000, 1_000_000) );
julia> @btime (findall(in($b),$a) , findall(in($a),$b))
  21.785 ms (20 allocations: 32.78 KiB)
julia> @btime sorted_intersect($a, $b)
  12.730 ms (20 allocations: 32.78 KiB)
julia> @btime sorted_intersect2($a, $b)
  11.034 ms (8 allocations: 260.47 KiB)

Version 2 is indeed slightly faster, with a small overhead in memory.

Version 1 will allocate memory equal to the size of the intersection, while
Version 2 will allocate m + n bytes of memory.

Depending on the application, the user may decide which is one is more appropriate.

Seif_Shebl · July 24, 2021, 5:34pm

Indeed, as the vector size increases, memory allocations happen less and less often and the benefit from using a BitArray becomes less evident after some threshold. Many algorithms follow this pattern, but the problem is that it is difficult to predict such a threshold, especially on different hardware. I remember applying these tricks to some algorithms like in Primes.jl, and around 10^9 vector lengths, push! becomes more efficient. I imagine push! will become more smarter in the future or hardware will be able to handle this and we won’t need such tricks.

FredericN · December 27, 2021, 5:50pm

Here is a version that works for non sorted arrays

a =[3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

function intersectalamatlab( a , b )
function findindices!( resa , ab , a)
for ( i , el) ∈ enumerate(ab)
resa[i] = findfirst( x->x==el , a )
end
end
ab = intersect(a,b)
resa=Vector{Int64}(undef,length(ab))
findindices!( resa , ab , a)
resa
resb=similar(resa)
findindices!( resb , ab , b)
resb

return (ab , resa , resb )

end

intersectalamatlab( a , b )

rafael.guerra · December 27, 2021, 6:08pm

@FredericN, fyi the accepted solution works for non-sorted arrays:

a = [3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

findall(in(b),a)    # indices of elements of 'a' that occur in 'b'
 3
 4

If we put your code “à la MATLAB” inside triple backticks:

function intersectalamatlab(a, b)
    function findindices!(resa, ab, a)
        for (i, el) ∈ enumerate(ab)
            resa[i] = findfirst(x->x==el, a)
        end
    end
    ab = intersect(a, b)
    resa = Vector{Int64}(undef, length(ab))
    findindices!(resa, ab, a)
    resb = similar(resa)
    findindices!(resb, ab, b)
    return (ab, resa, resb)
end

and run it, it outputs both the elements and the indices in both arrays:

intersectalamatlab(a, b)
([123, 12], [3, 4], [5, 1])

For the examples further above with a and b of length 1 million each, the accepted solution seems to run ~12x faster for unsorted inputs, and ~80x faster otherwise.

rocco_sprmnt21 · December 27, 2021, 7:12pm

I don’t know if it’s a relevant case, but how to handle repeated values in vectors?

a = [3 , 12, 45 , 123 , 12]
b = [12 , 19 ,123, 46 , 56 , 123]

learned_fool · December 27, 2021, 7:32pm

given both arrays are sorted, we could binary search the lower and upper bounds in Array A, while iterating through values in array B, giving us O(bloga) complexity.

One more way is to de-duplicate array a and b. We maintain a dictionary preserving a mapping of element -> indices it occurs in array, and then apply the algorithm from @Seif_Shebl 's answer .

rafael.guerra · December 27, 2021, 8:55pm

Possibly running it twice:

ia = findall(in(b), a)
ib = findall(in(view(a,ia)), b)
# list all common elements: a[ia]; b[ib]

FredericN · December 27, 2021, 9:36pm

The original matlab function ‘intersect’ returns not only the intersection but the positions in the original vectors, this is why it returns a triplet. I need such a function in Julia. If you have something faster that works for unsorted array, I would be glad to have it. Does it answer your question?

rocco_sprmnt21 · December 27, 2021, 9:38pm

The following code aspires to participate at the next Obfuscated Julia Code Contest


a = [3 , 12, 45 , 123 , 12,45,12,7,7,7,3,12,7]
b = [12 , 19 ,123, 46 , 56 , 123]



 function cod(l,x,y)
    exp=ceil(log(10,l))+1
    x*10^exp+y
 end

 function decod(l,n)
    grp=10^(ceil(log(10,l))+1)
    pos=Int[]
    while n>=1
    (n,r)=divrem(n,grp)
        push!(pos,r)
    end
    pos
 end

 
da=merge((x,y)->cod(length(a),x,y),Dict.(Base.splat(Pair).(reverse.(enumerate(a))))...)
db=merge((x,y)->cod(length(b),x,y),Dict.(Base.splat(Pair).(reverse.(enumerate(b))))...)

d=intersect(keys(da),keys(db))
collect(Iterators.flatten(decod.(length(a),get.([da], d,missing))))
collect(Iterators.flatten(decod.(length(b),get.([db], d,missing))))

 function decod1(l,n,pos=Int[])
    grp=10^(ceil(log(10,l))+1)
    (n,r)=divrem(n,grp)
    push!(pos,r)
    n >= 1 ? decod1(l,n,pos) : pos
 end

rafael.guerra · December 27, 2021, 10:08pm

@FredericN, as requested please find below a function intersectalajulia() based in the OP’s accepted solution, which handles repetead elements and runs faster for the million-element vectors benchmarked further below:

function intersectalajulia(a,b)
    ia = findall(in(b), a)
    return unique(view(a,ia)), ia, findall(in(view(a,ia)), b)   
end

a = [3 , 12, 45 , 123 , 12]
b = [12 , 19 ,123, 46 , 56 , 123]
intersectalamatlab(a, b)    # ([12, 123], [2, 4], [1, 3])
intersectalajulia(a,b)      # ([12, 123], [2, 4, 5], [1, 3, 6])

using BenchmarkTools
a = rand(1:1_000_000_000, 1_000_000);
b = rand(1:1_000_000_000, 1_000_000);
@benchmark intersectalamatlab(a, b)     # min = 846 ms, 24 alloc, 43.6 MiB
@benchmark intersectalajulia(a, b)      # min = 77 ms, 50 alloc, 18.1 MiB

FredericN · December 28, 2021, 8:59am

Thanks for your solution. But my problem is that with the following input:
a =[3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]

the outputs are not the same:
julia> intersectalajulia(a,b)
([123, 12], [3, 4], [1, 5])

julia> intersectalamatlab( a , b )
([123, 12], [3, 4], [5, 1])

I want to keep track that a[3] = b[5] and a[4] = b[1]. Your solution is faster but does not provide the output I am looking for.

rafael.guerra · December 28, 2021, 9:32am

OK, the intent was not clear in your OP.

I think Julia’s function indexin() can be added to achieve what you want and the original speed is barely affected:

function intersectalajulia2(a,b)
    ia = findall(in(b), a)
    ib = findall(in(view(a,ia)), b)
    return unique(view(a,ia)), ia, ib[indexin(view(a,ia), view(b,ib))]
end

Run several random tests (for non-repeated elements) and the equality was always true:

using BenchmarkTools, StatsBase
a = sample(1:rand(30:100), rand(1:20), replace=false);
b = sample(1:rand(30:100), rand(1:20), replace=false);
intersectalamatlab(a, b) == intersectalajulia2(a, b)  # true

FredericN · December 28, 2021, 10:49am

Dear Rafael,

many thanks for the answer but at least with Julia 1.6 I have problems with the following test:

a = rand(1:1_000, 1_000);
b = rand(1:1_000, 1_000);
intersectalajulia2(a,b)

I get one over five times (say) an error message of the type:
ERROR: BoundsError: attempt to access 615-element Vector{Int64} at index [Union{Nothing, Int64}[566, 129, 459, 110, 269, 372, 261, 265, 76, 137 … 42, 327, 201, 314, 16, 559, 97, 201, 609, 36]]
Stacktrace:
[1] throw_boundserror(A::Vector{Int64}, I::Tuple{Vector{Union{Nothing, Int64}}})
@ Base ./abstractarray.jl:651
[2] checkbounds
@ ./abstractarray.jl:616 [inlined]
[3] _getindex
@ ./multidimensional.jl:831 [inlined]
[4] getindex
@ ./abstractarray.jl:1170 [inlined]
[5] intersectalajulia2(a::Vector{Int64}, b::Vector{Int64})
@ Main ./none:4
[6] top-level scope
@ none:1

Topic		Replies	Views
Indexin() for two identical vectors New to Julia	2	491	August 13, 2018
Find position of Array elements in another Array General Usage indexing , arrays	16	3608	February 1, 2023
Find indices meeting multiple criteria searching from multiple arrays General Usage arrays	6	2589	October 23, 2021
Questions of vector New to Julia question , arrays	9	325	September 22, 2022
Hand written loop slower than bitVector broadcast Performance question	15	725	January 18, 2024

Indices of intersection of two arrays

Related topics