Hello ,
What would be the most efficient way to look for a number in a vector of vector of numbers?
i.e.
Data: t = [ [1,2,3] , [4,5,6] , [1,10,12], [1,4,5] ]
Input: 1
Output: 1,3,4
I could do something like findall(map(x->1 in x, t)) but this is quite slow, and I’m pretty sure it’s not the best way to do this.
Thanks!
I’m interested in an answer, too.
The following didn’t speed up the searching:
t_set = map(Set, t)
ix = findall(map(x->1 in x, t_set))
neither
t_sort = map(sort, t)
function fsn_loop(v, n)
# assume that the elements in v are sorted
hits = Int[]
for (i, v_l) in enumerate(v)
determined = false
for v_ll in v_l
if v_ll == n
push!(hits, i)
determined = true
break
end
if v_ll > n
determined = true
break
end
end
if determined
continue
end
end
return hits
end
ix = fsn_loop(t_sort, 1)
For t having 1e5 elements, @btime gives around 1 ms, as does DNF’s solution.
My previous experience tells me that this should be the fastest:
But I must say I have seen a number of strange performance issues in v1.11, and now it benchmarks as slower than findall(map(x->1 in x, t)), which, frankly, makes no sense to me, since the latter creates a redundant temporary array and also passes twice over memory
So I have not understood the problem first. I think the big question is, how frequently you want to run this search. If once, the answer by @DNF is OK. if multiple times, you should build the index. Like
a = [[1,2,3],[4,5,6],[1,5,6]]
index = Dict{Int,Vector{Int}}()
for (i, jj) in enumerate(a)
for j in jj
push!(get!(index, j, Int[]), i)
end
end
julia> index[1]
2-element Vector{Int64}:
1
3
f1 using findfirst -assuming that there is only one value being searched for or that it is enough to find the first one- would become more effective for vectors a little longer than 3.
Could you try for N=32?
function f1_in(a,e)
r=Vector{Int}(undef,length(a))
j=0
for i in eachindex(a)
if e in a[i] # !isnothing(findfirst(==(e),a[i]))
r[j+=1]=i
end
end
resize!(r,j)
end
The problem seems to be findall. With N = 32, T = Int32, x = T(1) and
t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000]
st = map(SVector{N}, t)
(as before), I get
julia> @b findall(Fix1(in, $x), $st)
68.203 ms (7 allocs: 124.781 KiB)
julia> @b [i for (i, w) in enumerate($st) if $x in w]
6.780 ms (7 allocs: 7.562 KiB)
EDIT: Also
julia> t2 = map(SmallVector{N}, t);
julia> @b [i for (i, w) in enumerate($t2) if $x in w]
7.595 ms (7 allocs: 7.562 KiB)
It seems that the implementation of some method of the function in is able to exploit the fact of having a static array better than findfirst can do.
it would be interesting to have a documentation of functions like these that at first glance seem to do the same thing (at least from a (high?) "logical" point of view), that explains what algorithm (algorithms?) they use in the various cases and when one can be “convenient” compared to the other.
A curiosity of a similar kind comes to me from the fact that the use of enumerate that makes available both the index and the value of an array is slower than the following version where instead from time to time, having only the index, you have to obtain the value of the array element (in this case in turn an array).
julia> T = Int32
Int32
julia> N = 32
32
julia> t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000];
julia> st=SArray{Tuple{N}}.(t);
julia> x=T(1)
1
julia> @b [i for (i, w) in enumerate($st) if $x in w]
16.229 ms (6 allocs: 7.609 KiB)
julia> @b [i for i in eachindex($st) if $x in $st[i]]
7.199 ms (6 allocs: 7.609 KiB)
This is because the implementation of in for SVector can be vectorized while that of findfirst (the default method for AbstractArray) cannot. However, findfirst for FixedVector and SmallVector is vectorized (for suitable types), and in fact in for SmallVector is defined as
I don’t find it surprising that enumerate is slower than eachindex. With Julia 1.11.0, the difference is quite small on my machine:
julia> @b [i for (i, w) in enumerate($st) if $x in w]
6.792 ms (7 allocs: 7.562 KiB)
julia> @b [i for i in eachindex($st) if $x in $st[i]]
6.035 ms (7 allocs: 7.562 KiB)