Do I need a vector database to solve this problem?

Suppose I have a list of known vectors x_i \in \mathbb{R}^m for i = 1 \dots n.

Now my boss comes to me and gives me a vector y \in \mathbb{R}^m and says, find the i that minimizes

\| y - x_i \|.

Clearly, the best you can do is compute the norm for each i and find the smallest one, at computational cost O(nm).

Now, what changes if I am allowed to preprocess the x_i prior to my boss timing my code? In the m = 1 case, I could get clever and sort the (scalar) x_i values. Then I can find the optimal i using binary search, so (not counting the time to sort the list) the search takes only O(m \log n) time.

Is there a similar approach for the m-dimensional case?

I have heard whispers of something called a “vector database” that is optimized for these kinds of operations, but I suspect that anything with the word “database” in it is probably overkill for this simple (?) problem, and I’m wondering if there’s a straightforward data structure one could use.

(Note that you don’t get to know y at the preprocessing stage, only the x_i.)


Miss this community—haven’t been as active since changing jobs and doing almost everything in Python.

2 Likes

This is a problem where you get to pick 2 of fast, easy, and exact. The fast easy approximate solution is to pick k<m (think 3 or so) random vectors vₖ of length m and compute xᵢ⋅vₖ for each i,k. If ∥y−xᵢ∥ is small, then yᵢ⋅vₖ≈xᵢ⋅vₖ. You do the binary searches on xᵢ⋅vₖ with the key yᵢ⋅vₖ, and find the ones that are close for all three. Making this algorithm robust and good takes some effort.

2 Likes

If I’m following right, this is exactly the problem of nearest neighbor search (in m-dimensional Euclidean space). There’s a bunch of approximate and exact techniques to solve it (many at the Wikipedia link there).

3 Likes

I think you may want to look into space partitioning datastructures. Database technologies ALWAYS generate absurd hype. Whether they solve your problem or not I don’t know, but proceed with caution. Always benchmark, you will be surprised.

1 Like

If n and m are known in advance, I guess that O(n m) = O(1), though?

Eric pointed to the Wikipedia page which lists (presumably) all the known algorithms, however I’d like to point out that there’s a way of improving the naive (linear search) solution in a “Julian” manner: move the data known in advance into the type domain.

Furthermore, to maximally exploit CPU architecture parallelism, the “linear” search could be restructured as a binary tree (because n is known in advance).

Once all the data is in the type domain in an appropriate manner, and assuming the arithmetic (whatever represents \mathbb{R}) is some machine-native type like Float64, I think Julia could generate blazing fast code, especially if everything fits into the instruction cache.

But the move into the type domain necessitates a run time dispatch, so this can’t be efficient when n and m are small.

You might also preprocess your data.

The task is \arg\min_x \lVert y - x \rVert which is equivalent to \arg\min_x \lVert y - x \rVert^2.The squared norm expands to \lVert y \rVert^2 + \lVert x \rVert^2 - 2x^Ty. The optimization is done w.r.t. x, so \lVert y \rVert^2 is just an additive constant and the problem reduces to \arg\min_x \lVert x \rVert^2 - 2x^Ty.

If the list of vectors x_i is fixed, you can precompute the squared norms \lVert x_i \rVert^2 for each i=1,\ldots, n. When your boss comes :), you only have to compute the dot products. The complexity remains \theta(n m) but it is less computation and it can be parallelized easily.

Edit: One more tweak: \arg\min_x \lVert x \rVert^2 - 2x^Ty = \arg\min_x \frac{1}{2} \lVert x \rVert^2 - x^Ty. So you can precompute \frac{1}{2} \lVert x_i \rVert^2 and save n multiplications when the boss comes!

1 Like

Sounds like exactly the problem for GitHub - KristofferC/NearestNeighbors.jl: High performance nearest neighbor data structures and algorithms for Julia.

2 Likes

If I’m understanding this correctly, you have to compute n dot products of m-vectors, so this is still O(nm) complexity, right?

Cool!

Yes, the complexity is the same.