Me vectors are very long and I am looking for efficient way to :
If
x=[“a”;“b”;“c”;“d”;“e”]
and
y=[“b”; “a”; “a”; “c”]
How to find , but very fast, position any y on x
expected result:
2
1
1
3
Thanks Pauls
Me vectors are very long and I am looking for efficient way to :
If
x=[“a”;“b”;“c”;“d”;“e”]
and
y=[“b”; “a”; “a”; “c”]
How to find , but very fast, position any y on x
expected result:
2
1
1
3
Thanks Pauls
Well, very fast is relative. The straightforward solution isn’t shabby
julia> using BenchmarkTools
julia> f1(x,y) = [findfirst(isequal(el), x) for el in y]
f1 (generic function with 1 method)
julia> @btime f1($x,$y);
108.864 ns (1 allocation: 96 bytes)
Depending on how much you (over-)simplified your MWE, you could even use Char
s instead of String
s to get a free speedup
julia> xc = collect('a':'e');
julia> yc = ['b', 'a','a', 'c'];
julia> @btime f1($xc,$yc);
71.986 ns (1 allocation: 96 bytes)
BTW, a similar alternative is
julia> f2(x,y) = [searchsortedfirst(x, el) for el in y]
f2 (generic function with 1 method)
julia> @btime f2($x,$y);
146.619 ns (1 allocation: 96 bytes)
julia> @btime f2($xc,$yc);
71.476 ns (1 allocation: 96 bytes)
In case the MWE is oversimplified and the vectors are really big, I found a q&a on stackoverflow.com. And then there is certainly more of this.
If x
is very large it will be more efficient to create a Dict
mapping each element of x
to its index.
Just to demonstrate @cjdoris’s great point:
julia> x = string.(Char.(1:1000));
julia> d = Dict(zip(x,1:length(x)));
julia> @btime f1($x,$y);
2.621 μs (1 allocation: 96 bytes)
julia> f3(d,y) = [d[el] for el in y]
f3 (generic function with 1 method)
julia> @btime f3($d,$y);
170.077 ns (1 allocation: 96 bytes)
Constant lookup is awesome
Remember that Julia has a built-in function for that, it is called indexin
:
julia> x = ["a", "b", "c", "d", "e"]
5-element Vector{String}:
"a"
"b"
"c"
"d"
"e"
julia> y = ["b", "a", "a", "c"]
4-element Vector{String}:
"b"
"a"
"a"
"c"
julia> indexin(y, x)
4-element Vector{Union{Nothing, Int64}}:
2
1
1
3
It may not be the fastest, but clean and easy to use.
If it’s not the fastest, file an issue (or make a PR). It should be.
How could I forget about it For comparison with the other variants (on my machine):
julia> f4(x,y) = indexin(y, x)
f4 (generic function with 1 method)
julia> @btime f4($x,$y);
491.732 ns (5 allocations: 624 bytes)
Definitely not the fastest!
It does basically the same thing as f3
. The big difference is that your timing of f3
didn’t include the time to make the dictionary. (also the way indexin
creates the dictionary isn’t especially efficient).
Fair enough. So it’s prioritizing the case of a large haystack. Would it perhaps make sense to decide on a strategy based on the length of the latter, i.e. use a polyalgorithm? Or do we just accept it (and only optimize the dict creation im possible)?
Probably worth a PR.
FWIW, for the inputs:
x = string.(Char.(1:1000))
y = string.(rand(Char.(1:1000), 500))
@carstenbauer’s f3(x,y)
, with the dictionary built-in, takes about 2/3 of the time of Base’s indexin(x,y)
If these are operations on big arrays and performance matters, you should really think about changing to a data structure that supports more efficient searching. Otherwise you are stuck with O(length(x)×length(y)) complexity. Like:
But ideally you should start with a data structure that supports the operations you want quickly, rather than starting with some other data structure and then converting (or constructing auxiliary dictionaries).
Not a fair comparison, because the indexin
function creates a dictionary mapping elements to indices, so when you apply @btime
to indexin
you are including the time to create the dictionary. In contrast, in your f3
benchmark you did not include the time to construct the Dict
.
(Note also that indexin
returns an array of Union{Nothing,Int}
since it allows for the possibility that an element is missing from x
, whereas your f3
code threw an error in this case, and hence didn’t have to pay the price of a type-unstable array.)