Fast ways to check if an element is in a vector of vector of elements

Hello ,
What would be the most efficient way to look for a number in a vector of vector of numbers?
i.e.
Data: t = [ [1,2,3] , [4,5,6] , [1,10,12], [1,4,5] ]
Input: 1
Output: 1,3,4

I could do something like findall(map(x->1 in x, t)) but this is quite slow, and I’m pretty sure it’s not the best way to do this.
Thanks!

Make it a set, or give it some ordering.

Thanks, but could you also please tell me how this would help?

findall(v->in(1, v), t)

or

findall(1 in v for v in t)
1 Like

I’m interested in an answer, too.
The following didn’t speed up the searching:

t_set = map(Set, t)
ix = findall(map(x->1 in x, t_set))

neither

t_sort = map(sort, t)
function fsn_loop(v, n)
    # assume that the elements in v are sorted
    hits = Int[]
    for (i, v_l) in enumerate(v)
        determined = false        
        for v_ll in v_l
            if v_ll == n
                push!(hits, i)
                determined = true
                break
            end
            if v_ll > n
                determined = true
                break
            end
        end
        if determined
            continue
        end
    end
    return hits    
end
ix = fsn_loop(t_sort, 1) 

For t having 1e5 elements, @btime gives around 1 ms, as does DNF’s solution.

My previous experience tells me that this should be the fastest:

But I must say I have seen a number of strange performance issues in v1.11, and now it benchmarks as slower than findall(map(x->1 in x, t)), which, frankly, makes no sense to me, since the latter creates a redundant temporary array and also passes twice over memory :man_shrugging:

Hm, okay, thank you!

since the latter creates a redundant temporary array and also passes twice over memory :man_shrugging:

I thought there would be a better way, but I guess I’m better off sticking to findall(map(x->1 in x, t)) for now in that case.

So I have not understood the problem first. I think the big question is, how frequently you want to run this search. If once, the answer by @DNF is OK. if multiple times, you should build the index. Like

a = [[1,2,3],[4,5,6],[1,5,6]]
index = Dict{Int,Vector{Int}}()
for (i, jj) in enumerate(a)
       for j in jj
       push!(get!(index, j, Int[]), i)
       end
 end

julia> index[1]
2-element Vector{Int64}:
 1
 3
3 Likes

I would give this version a chance too

function find1s(a)
    r=Int64[]
    for i in eachindex(a)
    if !isnothing(findfirst(==(1),a[i]))
        push!(r,i)
    end
    end
    r
end
2 Likes

My package SmallCollections.jl contains a vectorized version of in for suitable types (not yet in the published version).

EDIT: It’s also vectorized for SVector in StaticArrays.jl.

This might speed things up if the element vectors are “small” (say, up to 32 or 64 elements). For example, for

using SmallCollections, Chairmarks
using Base: Fix1

T = Int32
N = 3
t = [rand(T, N) for _ in 1:1_000_000]
x = T(1)

I get

julia> @b t findall(Fix1(in, $x), _)   # analogous to OP
5.546 ms (5 allocs: 122.250 KiB)

julia> @b map(FixedVector{N}, t) findall(Fix1(in, $x), _)
881.295 μs (5 allocs: 122.250 KiB)

julia> @b map(SmallVector{N}, t) findall(Fix1(in, $x), _)
3.204 ms (5 allocs: 122.250 KiB)

FixedVector{N,T} is like SVector{N,T} from StaticArrays.jl. SmallVector{N,T} can hold up to N elements of type T.

To try it out, you can install the relevant branch via

pkg> add https://github.com/matthias314/SmallCollections.jl#fixedvector

EDIT: fast in for SmallVector is now implemented.

1 Like

could you check this ?

function f1(a,e)
    r=Vector{Int}(undef,length(a))
    j=0
    for i in eachindex(a)
          !isnothing(findfirst(==(e),a[i])) && (r[j+=1]=i)
    end
    resize!(r,j) 
end
1 Like

Was this for me? I get (with the same t and x as before)

julia> @b t f1(_, $x)
4.813 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, t) f1(_, $x)
2.299 ms (3 allocs: 7.629 MiB)

Sorry.
Yes is for you.
I meant to do the proof by redefining the vector of vectors in the following way

T = Int32
N = 3
t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000]
st=SArray{Tuple{N}}.(t)

Here it is:

julia> @b f1($st, $x)
1.717 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, st) findall(Fix1(in, $x), _)
902.658 μs (6 allocs: 122.531 KiB)
1 Like

f1 using findfirst -assuming that there is only one value being searched for or that it is enough to find the first one- would become more effective for vectors a little longer than 3.
Could you try for N=32?

As in previous post, just with N = 32:

julia> @b f1($st, $x)
14.917 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, st) findall(Fix1(in, $x), _)
65.604 ms (7 allocs: 124.781 KiB)

julia> @b map(collect, st) findall(Fix1(in, $x), _)
26.547 ms (7 allocs: 124.781 KiB)

Now my version is much slower, even slower than findall with Vector. I don’t understand this because in is faster for FixedVector:

julia> w = st[1]; @b $x in $w
4.290 ns

julia> @b FixedVector{N}(w) $x in _
2.711 ns

julia> @b collect(w) $x in _
15.173 ns

Using findfirst looks slower:

julia> @b findfirst(==($x), $w) === nothing
13.547 ns

With in instead of findfirst, f1 becomes even faster (N = 32):

julia> @b f1($st, $x)
15.493 ms (3 allocs: 7.629 MiB)

julia> @b f1_in($st, $x)
6.463 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, st) f1_in(_, $x)
6.782 ms (3 allocs: 7.629 MiB)

where

function f1_in(a,e)
    r=Vector{Int}(undef,length(a))
    j=0
    for i in eachindex(a)
        if e in a[i]   # !isnothing(findfirst(==(e),a[i]))
            r[j+=1]=i
        end
    end
    resize!(r,j) 
end

The problem seems to be findall. With N = 32, T = Int32, x = T(1) and

t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000]
st = map(SVector{N}, t)

(as before), I get

julia> @b findall(Fix1(in, $x), $st)
68.203 ms (7 allocs: 124.781 KiB)

julia> @b [i for (i, w) in enumerate($st) if $x in w]
6.780 ms (7 allocs: 7.562 KiB)

EDIT: Also

julia> t2 = map(SmallVector{N}, t);
julia> @b [i for (i, w) in enumerate($t2) if $x in w]
7.595 ms (7 allocs: 7.562 KiB)
1 Like

It seems that the implementation of some method of the function in is able to exploit the fact of having a static array better than findfirst can do.
it would be interesting to have a documentation of functions like these that at first glance seem to do the same thing (at least from a (high?) ​​"logical" point of view), that explains what algorithm (algorithms?) they use in the various cases and when one can be “convenient” compared to the other.
A curiosity of a similar kind comes to me from the fact that the use of enumerate that makes available both the index and the value of an array is slower than the following version where instead from time to time, having only the index, you have to obtain the value of the array element (in this case in turn an array).

julia> T = Int32
Int32

julia> N = 32
32

julia> t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000];

julia> st=SArray{Tuple{N}}.(t);

julia> x=T(1)
1

julia> @b [i for (i, w) in enumerate($st) if $x in w]
16.229 ms (6 allocs: 7.609 KiB)

julia> @b [i for i in eachindex($st) if $x in $st[i]]
7.199 ms (6 allocs: 7.609 KiB)
1 Like

This is because the implementation of in for SVector can be vectorized while that of findfirst (the default method for AbstractArray) cannot. However, findfirst for FixedVector and SmallVector is vectorized (for suitable types), and in fact in for SmallVector is defined as

in(x, v::AbstractSmallVector) = findfirst(==(x), v) !== nothing

I don’t find it surprising that enumerate is slower than eachindex. With Julia 1.11.0, the difference is quite small on my machine:

julia> @b [i for (i, w) in enumerate($st) if $x in w]
6.792 ms (7 allocs: 7.562 KiB)

julia> @b [i for i in eachindex($st) if $x in $st[i]]
6.035 ms (7 allocs: 7.562 KiB)