Me vectors are very long and I am looking for efficient way to :

If

x=[“a”;“b”;“c”;“d”;“e”]

and

y=[“b”; “a”; “a”; “c”]

How to find , but very fast, position any y on x

expected result:

2

1

1

3

Thanks Pauls

Me vectors are very long and I am looking for efficient way to :

If

x=[“a”;“b”;“c”;“d”;“e”]

and

y=[“b”; “a”; “a”; “c”]

How to find , but very fast, position any y on x

expected result:

2

1

1

3

Thanks Pauls

Well, very fast is relative. The straightforward solution isn’t shabby

```
julia> using BenchmarkTools
julia> f1(x,y) = [findfirst(isequal(el), x) for el in y]
f1 (generic function with 1 method)
julia> @btime f1($x,$y);
108.864 ns (1 allocation: 96 bytes)
```

Depending on how much you (over-)simplified your MWE, you could even use `Char`

s instead of `String`

s to get a free speedup

```
julia> xc = collect('a':'e');
julia> yc = ['b', 'a','a', 'c'];
julia> @btime f1($xc,$yc);
71.986 ns (1 allocation: 96 bytes)
```

BTW, a similar alternative is

```
julia> f2(x,y) = [searchsortedfirst(x, el) for el in y]
f2 (generic function with 1 method)
julia> @btime f2($x,$y);
146.619 ns (1 allocation: 96 bytes)
julia> @btime f2($xc,$yc);
71.476 ns (1 allocation: 96 bytes)
```

3 Likes

In case the MWE is oversimplified and the vectors are really big, I found a q&a on stackoverflow.com. And then there is certainly more of this.

If `x`

is very large it will be more efficient to create a `Dict`

mapping each element of `x`

to its index.

4 Likes

Just to demonstrate @cjdoris’s great point:

```
julia> x = string.(Char.(1:1000));
julia> d = Dict(zip(x,1:length(x)));
julia> @btime f1($x,$y);
2.621 μs (1 allocation: 96 bytes)
julia> f3(d,y) = [d[el] for el in y]
f3 (generic function with 1 method)
julia> @btime f3($d,$y);
170.077 ns (1 allocation: 96 bytes)
```

Constant lookup is awesome

1 Like

Remember that Julia has a built-in function for that, it is called `indexin`

:

```
julia> x = ["a", "b", "c", "d", "e"]
5-element Vector{String}:
"a"
"b"
"c"
"d"
"e"
julia> y = ["b", "a", "a", "c"]
4-element Vector{String}:
"b"
"a"
"a"
"c"
julia> indexin(y, x)
4-element Vector{Union{Nothing, Int64}}:
2
1
1
3
```

It may not be the fastest, but clean and easy to use.

7 Likes

If it’s not the fastest, file an issue (or make a PR). It should be.

2 Likes

How could I forget about it For comparison with the other variants (on my machine):

```
julia> f4(x,y) = indexin(y, x)
f4 (generic function with 1 method)
julia> @btime f4($x,$y);
491.732 ns (5 allocations: 624 bytes)
```

Definitely not the fastest!

1 Like

It does basically the same thing as `f3`

. The big difference is that your timing of `f3`

didn’t include the time to make the dictionary. (also the way `indexin`

creates the dictionary isn’t especially efficient).

3 Likes

Fair enough. So it’s prioritizing the case of a large haystack. Would it perhaps make sense to decide on a strategy based on the length of the latter, i.e. use a polyalgorithm? Or do we just accept it (and only optimize the dict creation im possible)?

Probably worth a PR.

FWIW, for the inputs:

```
x = string.(Char.(1:1000))
y = string.(rand(Char.(1:1000), 500))
```

@carstenbauer’s `f3(x,y)`

, with the dictionary built-in, takes about 2/3 of the time of Base’s `indexin(x,y)`

2 Likes

If these are operations on big arrays and performance matters, you should really think about changing to a data structure that supports more efficient searching. Otherwise you are stuck with O(length(x)×length(y)) complexity. Like:

But ideally you should *start* with a data structure that supports the operations you want quickly, rather than starting with some other data structure and then converting (or constructing auxiliary dictionaries).

2 Likes

Not a fair comparison, because the `indexin`

function creates a dictionary mapping elements to indices, so when you apply `@btime`

to `indexin`

you are including the time to create the dictionary. In contrast, in your `f3`

benchmark you did *not* include the time to construct the `Dict`

.

(Note also that `indexin`

returns an array of `Union{Nothing,Int}`

since it allows for the possibility that an element is missing from `x`

, whereas your `f3`

code threw an error in this case, and hence didn’t have to pay the price of a type-unstable array.)

2 Likes