Find position on list (fast)

programista · November 22, 2021, 7:38pm

Me vectors are very long and I am looking for efficient way to :
If
x=[“a”;“b”;“c”;“d”;“e”]
and
y=[“b”; “a”; “a”; “c”]

How to find , but very fast, position any y on x
expected result:

2
1
1
3

Thanks Pauls

carstenbauer · November 22, 2021, 8:00pm

Well, very fast is relative. The straightforward solution isn’t shabby

julia> using BenchmarkTools

julia> f1(x,y) = [findfirst(isequal(el), x) for el in y]
f1 (generic function with 1 method)

julia> @btime f1($x,$y);
  108.864 ns (1 allocation: 96 bytes)

Depending on how much you (over-)simplified your MWE, you could even use Chars instead of Strings to get a free speedup

julia> xc = collect('a':'e');

julia> yc = ['b', 'a','a', 'c'];

julia> @btime f1($xc,$yc);
  71.986 ns (1 allocation: 96 bytes)

BTW, a similar alternative is

julia> f2(x,y) = [searchsortedfirst(x, el) for el in y]
f2 (generic function with 1 method)

julia> @btime f2($x,$y);
  146.619 ns (1 allocation: 96 bytes)

julia> @btime f2($xc,$yc);
  71.476 ns (1 allocation: 96 bytes)

goerch · November 22, 2021, 8:18pm

In case the MWE is oversimplified and the vectors are really big, I found a q&a on stackoverflow.com. And then there is certainly more of this.

cjdoris · November 22, 2021, 8:52pm

If x is very large it will be more efficient to create a Dict mapping each element of x to its index.

carstenbauer · November 22, 2021, 9:11pm

Just to demonstrate @cjdoris’s great point:

julia> x = string.(Char.(1:1000));

julia> d = Dict(zip(x,1:length(x)));

julia> @btime f1($x,$y);
  2.621 μs (1 allocation: 96 bytes)

julia> f3(d,y) = [d[el] for el in y]
f3 (generic function with 1 method)

julia> @btime f3($d,$y);
  170.077 ns (1 allocation: 96 bytes)

Constant lookup is awesome

juliohm · November 22, 2021, 9:16pm

Remember that Julia has a built-in function for that, it is called indexin:

julia> x = ["a", "b", "c", "d", "e"]
5-element Vector{String}:
 "a"
 "b"
 "c"
 "d"
 "e"

julia> y = ["b", "a", "a", "c"]
4-element Vector{String}:
 "b"
 "a"
 "a"
 "c"

julia> indexin(y, x)
4-element Vector{Union{Nothing, Int64}}:
 2
 1
 1
 3

It may not be the fastest, but clean and easy to use.

Oscar_Smith · November 22, 2021, 9:17pm

If it’s not the fastest, file an issue (or make a PR). It should be.

carstenbauer · November 22, 2021, 9:20pm

How could I forget about it For comparison with the other variants (on my machine):

julia> f4(x,y) = indexin(y, x)
f4 (generic function with 1 method)

julia> @btime f4($x,$y);
  491.732 ns (5 allocations: 624 bytes)

Definitely not the fastest!

Oscar_Smith · November 22, 2021, 9:28pm

It does basically the same thing as f3. The big difference is that your timing of f3 didn’t include the time to make the dictionary. (also the way indexin creates the dictionary isn’t especially efficient).

carstenbauer · November 22, 2021, 9:36pm

Fair enough. So it’s prioritizing the case of a large haystack. Would it perhaps make sense to decide on a strategy based on the length of the latter, i.e. use a polyalgorithm? Or do we just accept it (and only optimize the dict creation im possible)?

Oscar_Smith · November 22, 2021, 9:40pm

Probably worth a PR.

rafael.guerra · November 22, 2021, 9:57pm

FWIW, for the inputs:

x = string.(Char.(1:1000))
y = string.(rand(Char.(1:1000), 500))

@carstenbauer’s f3(x,y), with the dictionary built-in, takes about 2/3 of the time of Base’s indexin(x,y)

stevengj · November 22, 2021, 11:20pm

If these are operations on big arrays and performance matters, you should really think about changing to a data structure that supports more efficient searching. Otherwise you are stuck with O(length(x)×length(y)) complexity. Like:

But ideally you should start with a data structure that supports the operations you want quickly, rather than starting with some other data structure and then converting (or constructing auxiliary dictionaries).

stevengj · November 22, 2021, 11:24pm

carstenbauer:

How could I forget about it For comparison with the other variants (on my machine):
julia> f4(x,y) = indexin(y, x)
f4 (generic function with 1 method)

julia> @btime f4($x,$y);
  491.732 ns (5 allocations: 624 bytes)
Definitely not the fastest!

Not a fair comparison, because the indexin function creates a dictionary mapping elements to indices, so when you apply @btime to indexin you are including the time to create the dictionary. In contrast, in your f3 benchmark you did not include the time to construct the Dict.

(Note also that indexin returns an array of Union{Nothing,Int} since it allows for the possibility that an element is missing from x, whereas your f3 code threw an error in this case, and hence didn’t have to pay the price of a type-unstable array.)

Topic		Replies	Views
Finding the position in array New to Julia dataframes	8	770	February 15, 2022
Find position of Array elements in another Array General Usage indexing , arrays	16	3458	February 1, 2023
Finding Position of Element in an Array General Usage question	46	102518	October 9, 2019
Fast in.(x, Ref(y)) Performance	4	510	May 18, 2020
Findall slow General Usage	8	1885	October 24, 2019

Find position on list (fast)

Related topics