Perform a multi-column, multi-dimensional search quickly

I would like to perform a multi-column search from a multi-dimensional matrix like the one below, is there a fast way to do this? (In the example below, I want to get the rows next to -5.0 3.0 2.0)

Test_matrix =
[-5.0 4.0 2.0
-5.0 3.0 2.0
-5.0 3.0 3.0
-5.0 3.0 1.0
-4.0 3.0 2.0
-6.0 3.0 2.0
-5.0 3.0 2.0
5.0 -9.0 0.0
5.0 -3.0 0.0
5.0 -3.0 2.0]

In the example above, we want to get the columns 2 ,7,10.

This is why I wrote the following.
result_index_can = (Test_matrix[:,[1]]. == -5). *(Test_matrix[:,[2]]. == 3). *(Test_matrix[:,[3]]. == 2).

but in the real data, the speed of the search is a bottleneck due to the large matrix(over 1000000). Does anyone have a good solution for this?

One way of writing this is:

eachrow(Test_matrix) .== Ref([-5, 3, 2])

Another is:

findall(==([-5, 3, 2]), eachrow(Test_matrix))

NB: enclose your whole code inside triple backticks for proper display in discourse, which also enables the copy button

1 Like

Thank you for your beautiful codes!
But , index_can = eachrow(Test_matrix) .== Ref([-5, 3, 2])
is slower than
result_index_can = (Test_matrix[:,[1]]. == -5). *(Test_matrix[:,[2]]. == 3). *(Test_matrix[:,[3]]. == 2).

Is this missing interpolation, or the very small size of the MWE? I see:

julia> f1(x) = (x[:, 1] .== -5.0) .* (x[:, 2] .== 3.0) .* (x[:, 3] .== 2.0);

julia> f2(x) = eachrow(x) .== Ref([-5.0, 3.0, 2.0]);

julia> f3(x) = findall(==([-5.0, 3.0, 2.0]), eachrow(x));

julia> @btime f1($x);
  10.526 ms (10 allocations: 23.01 MiB)

julia> @btime f2($x);
  18.738 ms (8 allocations: 38.27 MiB)

julia> @btime f3($x);
  16.527 ms (1000008 allocations: 76.32 MiB)

consistent with what @Hikaru_Matsuoka says.

Btw is your example wrong? You say you want rows 2, 7, 10, but row 10 has 5 and -3 instead of -5 and 3? All the methods discussed so far wouldn’t return row 10 of Test_matrix

On my machine for the 1e6 example I find this to be an order of magnitude faster:

julia> function f4(x)
           res = Int64[]
           for i ∈ 1:size(x, 1)
               @inbounds if x[i, 1] == -5 && x[i, 2] == 3.0 && x[i, 3] == 2.0
                   push!(res, i)
               end
           end
           res
       end
f4 (generic function with 1 method)

julia> @btime f4($x);
  1.694 ms (6 allocations: 21.86 KiB)

Note that f3 and f4 above return only the integer indices of the matching rows (i.e. roughly a length 1,000,000/11^3 ≈ 751 vector for the example above with 11 randomly chosen integers from -5 to 5), while versions 1, and 2 give you a size(x, 1)-lenght vector of booleans.

To get such a vector with the speed of f4 you can do:

julia> function f5(x)
           res = fill(false, size(x, 1))
           for i ∈ 1:size(x, 1)
               @inbounds if x[i, 1] == -5 && x[i, 2] == 3.0 && x[i, 3] == 2.0
                   res[i] = true
               end
           end
           res
       end
f5 (generic function with 1 method)

julia> @btime f5($x);
  1.686 ms (2 allocations: 976.67 KiB)
2 Likes

Thanks @nilshg, using @benchmark without interpolation seems to work most of the time (similar results to @btime with interpolation). It works alright for my two expressions but not for OP’s…

Yeah I’ve never quite understood when you’re okay not to interpolate, so I just always do it (I always thought the issue was access to global variables, which I would have expected to affect all three expressions similarly, but clearly not!)

I should also add that while I tend to always sprinkle @inbounds into performance sensitive loops, this doesn’t actually seem to have any effect here - the compiler might be smart enough to eliminate the boundscheck for this simple loop.

And finally if one of the actual experts stops by and proposes some sort of @tturbo solution that runs in 50ns or something, maybe they could also explain why fill(false, size(x, 1) is two allocations?

Indeed, your code without loops is already fast.
As a side note, we normally use less brackets to index a matrix:

(x[:,1] .== -5) .& (x[:,2] .== 3) .& (x[:,3] .== 2)  

And here are some other simple options, using comprehensions, which are faster than your original code and only slightly slower than Nils’ solution above:

# outputs only matching integer indices:
[i for i in axes(x,1) if x[i,1]==-5 && x[i,2]==3 && x[i,3]==2]

# outputs all indices, as Vector{Bool} with 0 or 1:
[x[i,1] == -5 && x[i,2] == 3 && x[i,3] == 2 for i in axes(x,1)]
2 Likes

a slight variant of the solution of @rafael.guerra

[i for i in axes($x,1) if ($x[i,1]==-5 * $x[i,2]==3 * $x[i,3]==2)==1]