Find position on list (fast)

Me vectors are very long and I am looking for efficient way to :
If
x=[“a”;“b”;“c”;“d”;“e”]
and
y=[“b”; “a”; “a”; “c”]

How to find , but very fast, position any y on x
expected result:

2
1
1
3

Thanks Pauls

Well, very fast is relative. The straightforward solution isn’t shabby :slight_smile:

julia> using BenchmarkTools

julia> f1(x,y) = [findfirst(isequal(el), x) for el in y]
f1 (generic function with 1 method)

julia> @btime f1($x,$y);
  108.864 ns (1 allocation: 96 bytes)

Depending on how much you (over-)simplified your MWE, you could even use Chars instead of Strings to get a free speedup :slight_smile:

julia> xc = collect('a':'e');

julia> yc = ['b', 'a','a', 'c'];

julia> @btime f1($xc,$yc);
  71.986 ns (1 allocation: 96 bytes)

BTW, a similar alternative is

julia> f2(x,y) = [searchsortedfirst(x, el) for el in y]
f2 (generic function with 1 method)

julia> @btime f2($x,$y);
  146.619 ns (1 allocation: 96 bytes)

julia> @btime f2($xc,$yc);
  71.476 ns (1 allocation: 96 bytes)
3 Likes

In case the MWE is oversimplified and the vectors are really big, I found a q&a on stackoverflow.com. And then there is certainly more of this.

If x is very large it will be more efficient to create a Dict mapping each element of x to its index.

4 Likes

Just to demonstrate @cjdoris’s great point:

julia> x = string.(Char.(1:1000));

julia> d = Dict(zip(x,1:length(x)));

julia> @btime f1($x,$y);
  2.621 μs (1 allocation: 96 bytes)

julia> f3(d,y) = [d[el] for el in y]
f3 (generic function with 1 method)

julia> @btime f3($d,$y);
  170.077 ns (1 allocation: 96 bytes)

Constant lookup is awesome :slight_smile:

1 Like

Remember that Julia has a built-in function for that, it is called indexin:

julia> x = ["a", "b", "c", "d", "e"]
5-element Vector{String}:
 "a"
 "b"
 "c"
 "d"
 "e"

julia> y = ["b", "a", "a", "c"]
4-element Vector{String}:
 "b"
 "a"
 "a"
 "c"

julia> indexin(y, x)
4-element Vector{Union{Nothing, Int64}}:
 2
 1
 1
 3

It may not be the fastest, but clean and easy to use.

7 Likes

If it’s not the fastest, file an issue (or make a PR). It should be.

2 Likes

How could I forget about it :slight_smile: For comparison with the other variants (on my machine):

julia> f4(x,y) = indexin(y, x)
f4 (generic function with 1 method)

julia> @btime f4($x,$y);
  491.732 ns (5 allocations: 624 bytes)

Definitely not the fastest!

1 Like

It does basically the same thing as f3. The big difference is that your timing of f3 didn’t include the time to make the dictionary. (also the way indexin creates the dictionary isn’t especially efficient).

3 Likes

Fair enough. So it’s prioritizing the case of a large haystack. Would it perhaps make sense to decide on a strategy based on the length of the latter, i.e. use a polyalgorithm? Or do we just accept it (and only optimize the dict creation im possible)?

Probably worth a PR.

FWIW, for the inputs:

x = string.(Char.(1:1000))
y = string.(rand(Char.(1:1000), 500))

@carstenbauer’s f3(x,y), with the dictionary built-in, takes about 2/3 of the time of Base’s indexin(x,y)

2 Likes

If these are operations on big arrays and performance matters, you should really think about changing to a data structure that supports more efficient searching. Otherwise you are stuck with O(length(x)×length(y)) complexity. Like:

But ideally you should start with a data structure that supports the operations you want quickly, rather than starting with some other data structure and then converting (or constructing auxiliary dictionaries).

2 Likes

Not a fair comparison, because the indexin function creates a dictionary mapping elements to indices, so when you apply @btime to indexin you are including the time to create the dictionary. In contrast, in your f3 benchmark you did not include the time to construct the Dict.

(Note also that indexin returns an array of Union{Nothing,Int} since it allows for the possibility that an element is missing from x, whereas your f3 code threw an error in this case, and hence didn’t have to pay the price of a type-unstable array.)

2 Likes