Fast ways to check if an element is in a vector of vector of elements

rakshith95 · October 22, 2024, 3:29pm

Hello ,
What would be the most efficient way to look for a number in a vector of vector of numbers?
i.e.
Data: t = [ [1,2,3] , [4,5,6] , [1,10,12], [1,4,5] ]
Input: 1
Output: 1,3,4

I could do something like findall(map(x->1 in x, t)) but this is quite slow, and I’m pretty sure it’s not the best way to do this.
Thanks!

Tomas_Pevny · October 22, 2024, 3:43pm

Make it a set, or give it some ordering.

rakshith95 · October 22, 2024, 4:31pm

Thanks, but could you also please tell me how this would help?

DNF · October 22, 2024, 4:56pm

findall(v->in(1, v), t)

or

findall(1 in v for v in t)

zweiglimmergneis · October 22, 2024, 5:01pm

I’m interested in an answer, too.
The following didn’t speed up the searching:

t_set = map(Set, t)
ix = findall(map(x->1 in x, t_set))

neither

t_sort = map(sort, t)
function fsn_loop(v, n)
    # assume that the elements in v are sorted
    hits = Int[]
    for (i, v_l) in enumerate(v)
        determined = false        
        for v_ll in v_l
            if v_ll == n
                push!(hits, i)
                determined = true
                break
            end
            if v_ll > n
                determined = true
                break
            end
        end
        if determined
            continue
        end
    end
    return hits    
end
ix = fsn_loop(t_sort, 1)

For t having 1e5 elements, @btime gives around 1 ms, as does DNF’s solution.

DNF · October 22, 2024, 5:05pm

My previous experience tells me that this should be the fastest:

But I must say I have seen a number of strange performance issues in v1.11, and now it benchmarks as slower than findall(map(x->1 in x, t)), which, frankly, makes no sense to me, since the latter creates a redundant temporary array and also passes twice over memory

rakshith95 · October 22, 2024, 5:42pm

Hm, okay, thank you!

since the latter creates a redundant temporary array and also passes twice over memory

I thought there would be a better way, but I guess I’m better off sticking to findall(map(x->1 in x, t)) for now in that case.

Tomas_Pevny · October 22, 2024, 7:19pm

So I have not understood the problem first. I think the big question is, how frequently you want to run this search. If once, the answer by @DNF is OK. if multiple times, you should build the index. Like

a = [[1,2,3],[4,5,6],[1,5,6]]
index = Dict{Int,Vector{Int}}()
for (i, jj) in enumerate(a)
       for j in jj
       push!(get!(index, j, Int[]), i)
       end
 end

julia> index[1]
2-element Vector{Int64}:
 1
 3

rocco_sprmnt21 · October 22, 2024, 7:56pm

I would give this version a chance too

function find1s(a)
    r=Int64[]
    for i in eachindex(a)
    if !isnothing(findfirst(==(1),a[i]))
        push!(r,i)
    end
    end
    r
end

matthias314 · October 22, 2024, 8:32pm

My package SmallCollections.jl contains a vectorized version of in for suitable types (not yet in the published version).

EDIT: It’s also vectorized for SVector in StaticArrays.jl.

This might speed things up if the element vectors are “small” (say, up to 32 or 64 elements). For example, for

using SmallCollections, Chairmarks
using Base: Fix1

T = Int32
N = 3
t = [rand(T, N) for _ in 1:1_000_000]
x = T(1)

I get

julia> @b t findall(Fix1(in, $x), _)   # analogous to OP
5.546 ms (5 allocs: 122.250 KiB)

julia> @b map(FixedVector{N}, t) findall(Fix1(in, $x), _)
881.295 μs (5 allocs: 122.250 KiB)

julia> @b map(SmallVector{N}, t) findall(Fix1(in, $x), _)
3.204 ms (5 allocs: 122.250 KiB)

FixedVector{N,T} is like SVector{N,T} from StaticArrays.jl. SmallVector{N,T} can hold up to N elements of type T.

To try it out, you can install the relevant branch via

pkg> add https://github.com/matthias314/SmallCollections.jl#fixedvector

EDIT: fast in for SmallVector is now implemented.

rocco_sprmnt21 · October 22, 2024, 9:11pm

could you check this ?

function f1(a,e)
    r=Vector{Int}(undef,length(a))
    j=0
    for i in eachindex(a)
          !isnothing(findfirst(==(e),a[i])) && (r[j+=1]=i)
    end
    resize!(r,j) 
end

matthias314 · October 22, 2024, 9:18pm

Was this for me? I get (with the same t and x as before)

julia> @b t f1(_, $x)
4.813 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, t) f1(_, $x)
2.299 ms (3 allocs: 7.629 MiB)

rocco_sprmnt21 · October 22, 2024, 9:23pm

Sorry.
Yes is for you.
I meant to do the proof by redefining the vector of vectors in the following way

T = Int32
N = 3
t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000]
st=SArray{Tuple{N}}.(t)

matthias314 · October 22, 2024, 9:35pm

Here it is:

julia> @b f1($st, $x)
1.717 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, st) findall(Fix1(in, $x), _)
902.658 μs (6 allocs: 122.531 KiB)

rocco_sprmnt21 · October 22, 2024, 9:46pm

f1 using findfirst -assuming that there is only one value being searched for or that it is enough to find the first one- would become more effective for vectors a little longer than 3.
Could you try for N=32?

matthias314 · October 22, 2024, 10:05pm

As in previous post, just with N = 32:

julia> @b f1($st, $x)
14.917 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, st) findall(Fix1(in, $x), _)
65.604 ms (7 allocs: 124.781 KiB)

julia> @b map(collect, st) findall(Fix1(in, $x), _)
26.547 ms (7 allocs: 124.781 KiB)

Now my version is much slower, even slower than findall with Vector. I don’t understand this because in is faster for FixedVector:

julia> w = st[1]; @b $x in $w
4.290 ns

julia> @b FixedVector{N}(w) $x in _
2.711 ns

julia> @b collect(w) $x in _
15.173 ns

Using findfirst looks slower:

julia> @b findfirst(==($x), $w) === nothing
13.547 ns

matthias314 · October 22, 2024, 10:21pm

With in instead of findfirst, f1 becomes even faster (N = 32):

julia> @b f1($st, $x)
15.493 ms (3 allocs: 7.629 MiB)

julia> @b f1_in($st, $x)
6.463 ms (3 allocs: 7.629 MiB)

julia> @b map(FixedVector{N}, st) f1_in(_, $x)
6.782 ms (3 allocs: 7.629 MiB)

where

function f1_in(a,e)
    r=Vector{Int}(undef,length(a))
    j=0
    for i in eachindex(a)
        if e in a[i]   # !isnothing(findfirst(==(e),a[i]))
            r[j+=1]=i
        end
    end
    resize!(r,j) 
end

matthias314 · October 22, 2024, 10:44pm

The problem seems to be findall. With N = 32, T = Int32, x = T(1) and

t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000]
st = map(SVector{N}, t)

(as before), I get

julia> @b findall(Fix1(in, $x), $st)
68.203 ms (7 allocs: 124.781 KiB)

julia> @b [i for (i, w) in enumerate($st) if $x in w]
6.780 ms (7 allocs: 7.562 KiB)

EDIT: Also

julia> t2 = map(SmallVector{N}, t);
julia> @b [i for (i, w) in enumerate($t2) if $x in w]
7.595 ms (7 allocs: 7.562 KiB)

rocco_sprmnt21 · October 23, 2024, 8:29am

It seems that the implementation of some method of the function in is able to exploit the fact of having a static array better than findfirst can do.
it would be interesting to have a documentation of functions like these that at first glance seem to do the same thing (at least from a (high?) "logical" point of view), that explains what algorithm (algorithms?) they use in the various cases and when one can be “convenient” compared to the other.
A curiosity of a similar kind comes to me from the fact that the use of enumerate that makes available both the index and the value of an array is slower than the following version where instead from time to time, having only the index, you have to obtain the value of the array element (in this case in turn an array).

julia> T = Int32
Int32

julia> N = 32
32

julia> t = [T.(rand(1:10^5, N)) for _ in 1:1_000_000];

julia> st=SArray{Tuple{N}}.(t);

julia> x=T(1)
1

julia> @b [i for (i, w) in enumerate($st) if $x in w]
16.229 ms (6 allocs: 7.609 KiB)

julia> @b [i for i in eachindex($st) if $x in $st[i]]
7.199 ms (6 allocs: 7.609 KiB)

matthias314 · October 23, 2024, 12:14pm

This is because the implementation of in for SVector can be vectorized while that of findfirst (the default method for AbstractArray) cannot. However, findfirst for FixedVector and SmallVector is vectorized (for suitable types), and in fact in for SmallVector is defined as

in(x, v::AbstractSmallVector) = findfirst(==(x), v) !== nothing

I don’t find it surprising that enumerate is slower than eachindex. With Julia 1.11.0, the difference is quite small on my machine:

julia> @b [i for (i, w) in enumerate($st) if $x in w]
6.792 ms (7 allocs: 7.562 KiB)

julia> @b [i for i in eachindex($st) if $x in $st[i]]
6.035 ms (7 allocs: 7.562 KiB)

Topic		Replies	Views
Find position on list (fast) General Usage	13	778	November 22, 2021
Fastest way to fetch an index General Usage	29	799	March 19, 2024
Findall() not behaving well for an array of size 1 General Usage question	4	72	April 8, 2025
Fastest way to filter when right hand sinde of `in` is large Performance	3	395	December 28, 2020
Find the max element of a numeric vector iteratively with mask and early break Performance question , algorithm	9	203	June 27, 2025

Fast ways to check if an element is in a vector of vector of elements

Related topics