Julian way of MATLAB ismember function?

Let two arrays a = ["1", "4", "7"] and b = ["1","2","3,"4","5"].

In MATLAB, [Lia,Locb] = ismember(a, b) will lead to

Lia = [1, 1, 0]
Locb = [1, 4, 0]

Quickly looking over docs, I coudn’t find a Julian equivalent of ismember.
Instead I can think two alternative ways.

# Way 1: Using comprehending lists
Lia = [each_item in b for each_item in a]
Locb = [findfirst(b, each_item) for each_item in a]

# Way 2: Vectorization (edited)
Lia = broadcast(x -> in(x, b), a)
Locb = broadcast(x -> findfirst(b, x), a)

Any better ways?

1 Like

If you want it to be short you can do

Lia = in.(a, [b])
LocB = findfirst.([b], a)

In practice I’d look at what the results of ismember are being used for. It’s quite possible that in Julia you wouldn’t need to form those vectors explicitly.

4 Likes

I found that there is an indexin function. It’s not exactly same as matlab, but…

indexin(a, b)

  Returns a vector containing the highest index in b for each value in a that is a member of b . The output vector
  contains 0 wherever a is not a member of b.

  julia> a = ['a', 'b', 'c', 'b', 'd', 'a'];

  julia> b = ['a','b','c'];

  julia> indexin(a,b)
  6-element Array{Int64,1}:
   1
   2
   3
   2
   0
   1

  julia> indexin(b,a)
  3-element Array{Int64,1}:
   6
   4
   3
1 Like

As wonjinChoi said, indexin should do the heavy lifting. And Lia can be calculated from LocB to save time. In code:
LocB = indexin(a,b) Lia = 1.*(LocB .!= 0)

this gives the following:

julia> a = ['1', '4', '7']; b = ['1','2','3','4','5'];

julia> LocB = indexin(a,b)
3-element Array{Int64,1}:
 1
 4
 0

julia> Lia = 1.*(LocB .!= 0)
3-element Array{Int64,1}:
 1
 1
 0

Thank you for all suggestions.

@GunnarFarneback I’m porting a MATLAB toolbox (COBRA Toolbox; https://github.com/opencobra/cobratoolbox/) into Julia as a practice. The whole story is too long… In this toolbox, this function is employed to check whether a newly added row(s) (metabolites) already exists in the sparse matrix.

Is it just by accident that your two input arrays in your example are sorted? If you can guarantee sort order of your inputs then typically you can solve this problem much faster that using in, indexin, or findfirst (or ismember in Matlab). For the case where one of your input arrays is very small relative to the other you can use searchsorted on each element of the small array, so you get k calls to an O(ln(n)) algorithm. For the case where both input arrays are of similar size you can use a simultaneous single loop over both arrays, hence the algorithm is O(n). There are probably several examples of this algorithm online but if you want one in Julia I have it so let me know and I’ll post it here.

Ahh, that was just by accident. Elements of the array do not need to be sorted.
I learned some basics of algorithms and Big-Oh notations but have almost forgotten :slight_smile:
If you would show me some examples, I’d really appreciate it.

There’s no way I’ll ever manage to give a better explanation of Big-Oh notation than you’ll find here:

FYI, given that you can’t guarantee sort order of inputs, the answers provided by others here are as good as you can get (they’ll run in O(n*ln(n)))

Cheers,

Colin

Actually, indexin(a, b) return the highest index in b for each value in a that is a member of b. while ismember in Matlab return the lowest index, which may cause different results unintended.

If using Lia = in.(a, [b]), does it use TimSort to keep within O(n) if the vectors are already sorted? Is TimSort the “default” algorithm for vector methods that require sorting?

(Note that searchsorted doesn’t handle searching for members of a vector within another vector in the same initially requested.)

There’s no sorting taking place. Given vectors a and b,

in.(a, [b])

is equivalent to

for x = a
    in(x, b)
end

which in turn is equivalent to

for x = a
    in = false
    for y = b
        (in = x == y) && break
    end
end

In other words, the complexity is O(nm) where n is the length of vector a and m is the length of vector b.

As stated above, if the vectors are already sorted, you can do much better, but as far as I know there’s no built-in support for that in Julia, in which case you’d have to implement it yourself or look for an existing package.

2 Likes

Thanks for the response! And the really clear explanation of in.(a,[b]) and how it breaks down to in().

For strings arrays use the following:

 A= ["Hey","Hello","Hi"];
 B= ["Hey","Hi","Goodbye","Bye"];
boolianIndex=occursin.(A,B); 

This does not work in version 1.1.0

julia> A= ["Hey","Hello","Hi"];

julia> B= ["Hey","Hi","Goodbye","Bye"];

julia> boolianIndex=occursin.(A,B); 
ERROR: DimensionMismatch("arrays could not be broadcast to a common size")
Stacktrace:
 [1] _bcs1 at ./broadcast.jl:438 [inlined]
 [2] _bcs at ./broadcast.jl:432 [inlined]
 [3] broadcast_shape at ./broadcast.jl:426 [inlined]
 [4] combine_axes at ./broadcast.jl:421 [inlined]
 [5] instantiate at ./broadcast.jl:255 [inlined]
 [6] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(occursin),Tuple{Array{String,1},Array{String,1}}}) at ./broadcast.jl:753
 [7] top-level scope at none:0
1 Like

Thats correct, occursin is for checking whether one string occurs in another string, like occursin("Hey", "HeyHo").
If you want to check whether the strings in Array A are in Array B, you can still use in.(A, Ref(B)).
Or if you want the index, indexin(A,B).

2 Likes

Oh you are right, I wrote this example without testing it.
One of the matrices should be only one string.
Try the following:

A= ["Hey"];
B= ["Hey","Hi","Goodbye","Bye"];
boolianIndex=occursin.(A,B); 

Thank you! in.(A, Ref(B)) works perfectly.

If B is more than a handful of elements, I highly recommend:

in.(A, Ref(Set(B)))

as that’ll enable quick lookups of elements in B instead of repeatedly linearly scanning through the entire thing.

3 Likes